Quantcast
Channel: Artificial Intelligence News, Analysis and Resources - The New Stack
Viewing all articles
Browse latest Browse all 537

What ChatGPT and Claude Can See on Your Screen

$
0
0
vision

Pasting screenshots into ChatGPT, and now Claude, has become a regular practice for me. As we first saw in my post about an LLM-backed Datasette plugin, the ability of LLMs to read text in images has torn down the barrier that once separated data from pictures of data. That’s a big deal, but their screen-reading power extends far beyond just reading text.

In this case, while debugging SQL, I found it easier to provide LLMs with screenshots of Postgres output than to copy text from the terminal.

Here I pasted a screenshot of a Python debugging session.

That picture is worth quite a few words. It says that we’re operating in the context of Python’s Google API client and that we’ve authenticated to the service with some kind of valid credential, but the document ID is wrong or a necessary scope wasn’t granted (or requested by the app), or perhaps there’s a different problem. Using words to transmit that context to another person would be tedious — that’s why we screenshare. Using the same words to prompt a language model would be just as tedious. The ability to show rather than tell is a gamechanger.

In this example we saw ChatGPT guide me as I learned how to plot equations in GeoGebra, a tool that I’d only just started to use.

Think about what’s happening here. ChatGPT’s world knowledge includes the GeoGebra documentation, so it can offer general guidance about how to navigate the tool’s interface. But with screenshots that show how my plotting efforts are failing, it can also give me specific guidance about how to use the tool in a given task context.

A Tragic Flaw of Modern Web Applications

Every day, interactions like this occur countless times:

Customer: Where is my payment history?

Support rep: Do you see the menu on the top right of the screen? The thing with three horizontal lines? Click that, and then look for…

What’s tragic? Web software could enable this instead:

Customer: Where is my payment history?

Support rep: [pastes link] Use this link.

The notion of hypertext as the engine of application state was implicit in the web from the very beginning. A decade later Roy Fielding famously made it explicit in his dissertation. Early web applications reflected application state in their URLs far more comprehensively than modern ones typically do. In the dominant SPA (single-page application) style, that might not happen at all — which, to put it bluntly, sucks.

I would rather reserve the ecological impact of LLMs for other things, because web software could be far more easily navigable and teachable. But it mostly isn’t, that’s a huge problem, and LLMs can help.

Navigating Cloud Consoles

I have recently been writing the kind of documentation that I wish didn’t have to exist. As prerequisites for tutorials on Turbot Guardrails, I needed to explain how to create custom roles in AWS, Azure and GCP. The tutorials walk you through these procedures step by step, showing each screen along the way, with red ovals and arrows to circle or point to key screen elements. Although Guardrails is Turbot’s bread-and-butter product, I’ve mainly focused on the open source suite; I was rusty at some of these procedures, and others were completely new to me.

Let’s look at how sharing screens with LLMs helped me get up to speed.

Using CloudFormation to Create a Role

One procedure entails creating a CloudFormation stack that will in turn create an IAM role. When I got here, I wasn’t sure what to do.

I knew that I needed to upload the template I’d been given. But was I working with new or existing resources? I could see it either way. So I pasted the screen into ChatGPT and Claude along with this prompt.

I’ve been given a CloudFormation template that will create a role. I am here. What do I do next?

Both assistants confirmed that I should choose the default: new resources. And both spelled out the sequence of next steps. But it was easier to just share my screen at each step along the way.

When I got here I wanted to confirm I was doing the right thing.

My prompt:

I am here. The purpose of this stack is to create a role. Does this look correct?

ChatGPT’s reassuring answer.

Claude, however, got confused.

I was pretty sure that was wrong, and (per rule #4 from Best Practices for Working With Large Language Models, “ask for choral explanations”) I relayed Claude’s response to ChatGPT, which replied:

I might easily have shared Claude’s confusion about the distinction between permissions needed to create the stack and permissions granted by the role the stack would create. Its error wasn’t a showstopper. Rather, it fed into a dialectic that helped me solidify my understanding of the process.

Creating a GCP Custom Role

When I got here, I had questions.

My prompt:

What do the asterisks mean? Should I overwrite Title and ID? Do I care about Role launch stage?

Again, ChatGPT was more helpful, confirming my hunch that I did want to rewrite both Title and ID, and that the default launch stage was OK for my purpose.

When I got here, I had more questions.

My prompt:

I am here. I need to add the permission to alter properties on buckets. How do I figure out what permission I need, and how do I find and add it?

Here too ChatGPT was more helpful. It suggested that storage.buckets.update was the permission I needed, and that I could type storage.buckets into the filter box to narrow down the list and make that a selectable option. But I had a follow-up question:

There are two filter boxes?

ChatGPT provided the guidance I was looking for.

If you work with these cloud platforms every day, you have doubtless forgotten that you ever had questions like these. But every newcomer does. And on a continuing basis, we are all newcomers to various aspects of applications and services. In so many ways, the experience boils down to: I am here, what do I do now?

It’s nice if you can share your screen with someone who has walked that path before you, but that’s often impossible or infeasible. LLMs synthesize what others have learned walking the path. We typically use words to search that body of hard-won knowledge. Searching with images can be a powerful complementary mode.

In Defense of Claude

Although Claude fared poorly in these examples, I don’t want to leave you with the wrong impression. Its guidance for developers working with cloud platforms seems, for now, inferior to ChatGPT’s. But that isn’t because Claude can’t see the screens. It sees them perfectly well but doesn’t (yet) do a great job connecting them to associated task knowledge.

Here’s a nice example of how Claude’s ability to see screens helped me get a job done. Recently I discovered and fell in love with lite.cnn.com, one of a handful of text-only news sites that deliver the sort of fast and distraction-free experiences that you may have forgotten are still possible on the web. Each article at lite.cnn.com ends with links to See full web article and Go to the full CNN experience. Which is hilarious because, once you can actually just read the article, why would you want to dive back into that chamber of horrors?

After using lite.cnn.com for a while, though, I did find myself wanting the option to view images and realized a browser extension could fetch them from the full site and make them available in the lite page. When Claude and I wrote an extension to do that, we wrestled with how to display the images. My first thought was to float shrunk versions to the right of the paragraphs in which they occurred. Layout was tricky, though. As we iterated through variations I found that I didn’t have to describe the layouts the extension was producing, I could just screenshot them and paste the images into Claude’s input box. That was enough to prompt it to try a different layout strategy.

In the end I decided to do the Simplest Possible Thing and accumulate the images (with their captions) at the end of each lite article. That approach is appealing for two reasons. First, it is the Simplest Possible Thing. Second, it aligns with a distraction-free approach. We’re now conditioned to abhor the dreaded wall of text, but those images don’t always add value. By pushing them to the end of the article I’m choosing to focus on the text and exercise a reading muscle that has atrophied to an alarming degree.

That’s just a personal choice, though. If you feel differently, I encourage you to fork the repo and see what you and Claude and/or ChatGPT can come up with. I think you’ll find that showing them tweaked layouts as they evolve is a delightful alternative (or complement) to describing them in words.

The post What ChatGPT and Claude Can See on Your Screen appeared first on The New Stack.

The ability to show not tell in LLMs is a gamechanger for developers. Showing an LLM an image can help you learn how to use software.

Viewing all articles
Browse latest Browse all 537

Trending Articles