Prototyping AI with AI, or Replacing Wizard of Oz Prototyping with AI

Context

A while back, I used the Wizard of Oz methodology to test an AI experience for Staples.

For the project, the idea was that we would have users tell us about their desired print product via user inputted text or image, and we would recommend a print product based on their input.

The process required me to do live edits on Protopie, Figma and Photoshop to generate the prototypes for the user while they did the test. This is called Wizard of Oz prototyping, and while it worked after some practice, it required a lot of prep time, lots of work, and the hope that the person generating the prototypes (me) does not take too long or otherwise mess up and bungle the test.

We also found that users wanted to play around and "re-roll" their recommendation and customize their results, which we know now is half the fun of AI. Prototyping dynamic results pages on the fly based on user input, however, even with Wizard of Oz methodology would have made the test far less manageable (if at all doable). At the time.

Now, with Figma Make, Claude and other agentic tools, we can not only generate different results pages quickly, we can also potentially vibe-code the entire experience with no need of Wizard of Oz.

Since I've been on maternity leave for a little while, and having played around with AI programs a little bit, but not enough to test out an entire project, I took this as an opportunity to try it out for myself. Let's prototype AI with AI and see if/how we can take Wizard of Oz out of the equation.

‍

Goals

• I want to see to what extent AI tools can generate dynamic results pages based on user input. Can the AI provide better results pages that look believable enough for a quick test?

• I want to compare the accuracy of the results, and how quickly I can make the prototype with my original project.

‍

Test Methodology

To start with, I decided to create something parallel to the AI product discovery project to better compare the two testing experiences.

I settled on streaming recommendations- an AI website that would recommend movies and shows available on streaming based on user preferences and location.

Since the purpose of the project is to test a testing methodology, I'm skipping research and exploration, and manual designs and jumping straight to generated designs.

‍

Claude

I mostly started with Claude because I wanted to see what it can do, and because the point of this experiment is quick and accurate prototyping, not the design specifically.

I prompted Claude to create my pages with two prompts:
"Can you create a mockups of a website that generates streaming recommendations based on your location, what you're subscribed to, and your movie tastes?"

‍

"Can you make the profile setup screens, and add in sections where the website asks the user what streaming platform they have, where they are, what kind of shows they're looking for (and here they can choose specific shows they're on the lookout for, or choose shows similiar to ones they like or just be descriptive)"

‍

And I was surprised with how well it followed instructions. The design choices are not to my taste and there are some features that would not be required, and although I can ask for very specific changes to the design, it would be a lot quicker if I could manually make the adjustments myself rather than prompting and waiting.

first page of generated design of a streaming reccomendation site. — First page of generated design

Second page of generated design of a streaming reccomendation site. — Second page of generated design

Third page of generated design of a streaming recommendation site. — Third page of generated design

Fourth page of generated design of a streaming recommendation site. — Fourth page of generated design

Some of the features did not work- I could not add in BBC as another service/subscription for example.

And there are some other things that I would definitely want to tinker with, and if this were a real project, or if I wanted to start from the design phase, I'd want to connect this with Figma with a design component library. But we're not here for that. (And I might play around with it later, or try designing it in Figma myself and using FIgma MCP to generate the code on Claude Code if I ever got around to getting the not-free Claude Code)

Now the more complicated part.

"Can you adjust the results so that it is shows dynamic results based on what the user puts in the profile setup?"

‍

Unfortunately, although Claude tried,

"Good — I can see the full file. The key change needed is replacing the static results screen with a dynamic rendering engine that reads the user's profile state and scores/filters the catalogue accordingly. I'll build a proper show catalogue with service tags, genre tags, and avoid flags, then compute match scores and generate personalized "why this?" reasoning at render time."

The dynamic results page, showing no matches found — It was too easy too fast

I don't think it could really build the catalogue or it couldn't connect the catalogue with the input. So let's try building one with another AI.

‍

ChatGPT

Let's ask ChatGPT to build a catalogue for us with the filters Claude created.

Can you provide me with an excel sheet full of maybe 300 tv shows and movies, with information on what streaming they're available on in which locations, the genre, the kind of mood you'd watch them, the pacing, and whether they contain gore, jump scares, explicit content, violence, cliffhangers and child or animal peril? And can you include images for each title?

‍

The biggest snag, according to ChatGPT was making it location-dependent, which would be a teeny bit too complicated to generate all at once, so it asked me to pick some "important countries" to make it a little easier, added more filters, and used the IMDB database to make me the table.

I had asked it to make a better curated file so that the results can be more accurate, so it couldn't hand me an excel sheet with 300 titles all at once, it could only generate about 25 at a time. Not a problem, since this is just for a prototype (and a made-up one at that) so I asked it to generate several batches until I had 150 titles in an excel sheet.

‍

Back to Claude (plus last minute changes)

I provided Claude the excel sheet, and tested out the results. The results were dynamic! And they were for the most part accurate to what I typed in!
I just asked for an inline, stickied edit drawer at the top of the results page, and here we are:

Screenshot of final results page- it works!

You can see the working prototype here and check it out yourself!

I swallowed my urge to tinker with some small UI changes and told myself to try again later when I get Claude Code. I'd definitely feel more comfortable having a agentic system starting with Figma so I have more control over the design, but once again- that's not the main goal of this experiment.

‍

Conclusions

The question was- can agentic tools make dynamic results page and a working prototype, and can it done quicker and more easily than with the Wizard of Oz methodology?

Assuredly yes, for the AI project that I was working on, agentic tools can build a much better prototype with dynamic results a lot quicker without the disadvantage of a nervous designer behind the scenes. The results were accurate too, and I bet they could be better if I spent more time on them on ChatGPT.

This leads me to ask- what other Wizard of Oz testing can be replaced with AI? I saw a wonderful presentation a few years ago at Axe-Con by Annabel Weiner and Courtney Benjamin where they used Wizard of Oz to do a usability test for screenreaders. Accessibility is - or maybe was- very hard to test for without developer intervention. If we can make it easier and quicker to test (make accessibility testing more accessible if you will), it would be a joy to see. I think maybe if I have a little more time, that I would like to try that next!

‍