Building AI Products: Behind the Scenes

Aug 02, 2023

If you’re reading this, you already know that the pace of innovation in AI is fast. It’s an exciting field to work in, but the fast pace of innovation can present its own set of challenges. How do you build a viable AI product when new products, companies, and features are being launched virtually every day?

One person who is living this reality is my friend David Kossnick, AI Product Lead at Coda. Coda is a productivity suite that recently launched an AI work assistant. This assistant helps people manage, plan, and organize a variety of tasks–from planning a cross-country road trip to organizing an enterprise company’s customer feedback.

For the past year, David has been working with different AI models and toying with new tools to improve Coda’s tech. Recently, we caught up for a conversation about what it’s like to build an AI product right now and what tools Coda is using. I appreciated David’s frankness and suspect you will, too. Here it is, edited and condensed for clarity:

Tamar: What foundation model are you using for Coda AI? Did you evaluate multiple models?

David: We tried a ton of stuff, basically everything that’s out there. When we first got started, we only had access to GPT-3. We were using 50 different GPT-3 prompts for different tasks. We ended up building our own AI router layer that determined based on a prompt, which GPT-3 template we should send it to. It was a huge pain.

When GPT-3.5 came out, we were super excited. At that point, we found that our fine-tuned GPT-3 models couldn’t really compete with GPT-3.5, which was kind of amazing. Even with a model that handled a more specific training set on a more specific task, it was very hard to compete with the quality of whatever the GPT-3.5 model produced.

When it comes to models, one of the companies and teams I'm impressed by is Amazon. We've been an early partner with Bedrock. Amazon Bedrock makes foundation models from third party providers and Amazon available via an API, so you can choose between the models and run them on AWS. From a security perspective, it’s really nice to use a model that never leaves our cloud.

Tamar: Can you tell us what other tools you’re using in the tech stack for Coda AI?

David: The entire ecosystem has evolved so fast that tools meant to enhance AI often become redundant only a few months after they’re released. For example, for a while at Coda, we were using LangChain, which makes it easy to do common operations on top of AI.

LangChain was initially useful after OpenAI first released GPT-3.5, which had a pretty limited amount of context that could be entered into its prompt box. The hack for this was entering in small amounts of content at a time, chunk by chunk, and passing along this chain of content to generate a complete summary or answer. With newer models, context windows have gotten bigger, so there’s less need for this type of tool.

We're now using a tool called Braintrust to monitor prompt quality over time and to evaluate whether one prompt or model is better than another. It's making it easy to turn iteration and optimization into a science.

Tamar: How do you decide which answer is better when you're evaluating prompt feedback?

David: This is actually really hard to do, especially when it comes to the type of work we do at Coda. A lot of the uses for Coda are highly specific, and an answer that works well for one person in one scenario might not work as well in a slightly different scenario. For example, if you’re summarizing meetings notes for keeping your team on track, you want a different output than if you’re summarizing customer call notes for product feedback.

When evaluating prompts, one of the biggest challenges is figuring out whether or not we’ve made real progress. Did we move two steps forward in one place, but three steps back somewhere else?

The best way to resolve this issue is by providing more examples and focusing on prioritization. The question we’ll ask ourselves is: “What are the key things we need for this to work well?”

Tamar: I’d love to hear more about your prompt evaluation process.

David: We’ve tried a lot of different approaches. At this point, I’ve probably chatted with about 20 or 30 prompt management tooling companies.

We do different evaluations on the prompts, one of which is logical checks. If you asked for bullets, did it give you bullets? We also have AIs evaluate quality and humans manually rate results. What’s surprising is that even really small changes can produce a different output. We’ve found cases where capitalizing a single word on current best in class models can generate a completely different result.

Tamar: So you could run the same prompt multiple times and get a different answer?

David: There's a parameter called temperature, which determines the randomness of a given model. An AI-model with “low-temperature” will have a deterministic, obvious answer. An AI-model with a “higher temperature” will deliver answers that are more creative/less deterministic.

This means that, to some extent, you can control how deterministic your answer will be. There are many cases where you’ll want the product to provide an output that’s higher temperature–where you’ll actually want it to give you a different answer each time.

Tamar: What kind of requirements are you getting from enterprises for privacy and security?

David: It's a moving landscape. The conversations we're having right now are very different from the conversations we were having just six months ago. I’m pretty sure that, right now, every company is in the process of writing up their AI internal policy documents.

It’s funny because at Coda, we just drafted our first AI usage terms–and we’re an AI company. We said to employees, you're not allowed to use ChatGPT through the UI. If you use OpenAI’s UI, they can train on your data. This means that if you ask ChatGPT’s UI what you should do differently in your strategy memo for the next year of projects, OpenAI now knows your next year of projects and can incorporate that into their next model and someone else could find out about it.

But at the API layer, OpenAI doesn’t do that. For paying customers like ourselves, OpenAI is prohibited from training on any data we send them through the API–which is really, really important to us, especially since we work with tens of thousands of enterprise teams.

Security is a big challenge right now. I will say the guarantee of non-training has made a huge difference to enterprise customers. By far and away, enterprise companies' number one concern is any kind of data leakage. The Samsung case earlier this year was a big warning sign to many people about that.

Tamar: How are you thinking about pricing for AI features?

David: For low volume usage, the cost is not that bad. If you ask a program to help you finish a paragraph or change your tone, it's hard to scale up the volume per user.

With Coda we have AI columns in tables. We also have tables that are integrations. This means that you can pull in 100,000 rows from Salesforce of all your accounts or 100,000 GitHub pull requests or JIRA tickets or emails with a click of a button.

In these instances, you can use AI columns to draft replies to all of your emails or do follow-ups to all the events you attend, and these messages can be customized based on the data you’ve brought in. This is where AI can become an automation workhorse. But the challenge with this type of output is that, with a single click, you can run 100,000 executions, each of which probably demands a huge amount of context. For that scale, we are thinking about ways to cover AI costs. Our goal is to make money off Coda not off AI, and make as much AI as accessible for as many people as possible.

Tamar: Is there a willingness to pay for this type of automation?

David: There are cases where it's saving people 20 hours a week, freeing up people to work on bigger goals, or harder problems. In other cases, people are using AI to explore items that might initially seem “nice to have”, but they’re realizing the significant value they’re providing. For example, having a clear view of outcomes from every meeting.

With how much value is generated through some of these use cases, it helps to think about the cost for AI tools in terms of solving technical debt or process debt. AI can help solve a problem or explore use cases that otherwise wouldn’t happen.

Tamar: From these examples, it’s undeniable that AI will have a transformative impact on productivity. How do you see AI continuing to progress?

David: There's a lot of hype right now on prompts and language being the ultimate interface. Personally, I’m quite skeptical about this, simply because language is really, really hard. In many cases, UI is actually easier and better.

We built Coda AI first as a set of building blocks where you could enter a prompt. Over time, we've been adding more layers to the interface. We call these layers “one-clicks” where you click “summarize” and it gives you a summary. It skips the prompt box entirely. It’s no surprise that people love them. While natural language is compelling, I think we're going to see a step away from it. Over the next few years, I believe we’ll move towards uses that are more familiar and precise–not to mention faster.

Right now, we think about Coda AI as a co-pilot, not an autopilot. In many cases, the quality still has flaws and you want to be sure you review and tweak the output. We built all of our UI for this. Anything AI gives back is editable. It's not static. As quality improves over the next few years, people will develop more trust. That’s when we’ll start trusting AI to act as an autopilot, rather than a co-pilot.

Tamar: How is Coda using AI internally to help with your own productivity?

David: We use Coda everywhere. As a product manager, I use AI in every meeting to extract action items and next steps. All the customer feedback that comes in about Coda, Coda itself is categorizing, pulling out, and summarizing themes. I use it to help draft our press releases and blog posts. I’m even using it to reply to my emails.

Tamar: I’ll remember that next time you send me an email. Are there any gaps where you wish there was a tool that performed a certain task?

David: There's a ton of gaps and a ton of pain which a lot of people are aware of and working on, but they're not closed yet. Everything around prompt evaluation, model evaluation, fine tuning is very hard right now.

Part of what makes it so difficult is that Turbo, the OpenAI 3.5 model is not fine tunable. But it's still much better than anything else on the market, so you have to use it, which means you’ve ended up in this dance where your main configurable layer is prompts. And the prompts are brittle, fragile, and non-deterministic.

Right now, there's a lot of energy being spent on the prompt management optimizing/tuning layer, which is really, really tough. It'll be very interesting to see what happens, as soon as we have turbo-like models that are fine-tunable.

Tamar: There's a lot of hype around AI today. Is there anything that you feel is overhyped?

David: I’ll state my bias directly: I’m extremely hyped on AI. So I’m probably the wrong person to ask hah. I started my career at Google and I’ve always been an AI head. I'm probably more excited now than I have ever been in terms of promise and opportunity. What I’m most excited about is what the future of work will look like, especially for knowledge workers.

I have a lot of friends, especially PMs who are scared right now. They’re asking, “What's going to happen to my job?” But I believe that we’re on the cusp of an incredible opportunity.

There's that idea about 10x engineers. 10x engineers are rapidly becoming 100x engineers: how productive can we be with a co-pilot and co-generation? In the future, the 100x PMs will manage more projects in parallel and then go deeper on the hardest one, and get to an insight faster with an AI's help. The 100x designers will use Figma AI-plus to generate a huge array of mock-ups and then dig into the key ones.

The other thing we'll probably see is more hybrid roles: people who aren't necessarily engineers but will now be able to write initial drafts of code, and work with teams to refine and ship. People who are new to becoming a PM can now manage an array of projects. People who aren't marketers will be able to articulate and share their vision, enabling marketing teams to rethink ideas and strategies. Small teams will move even faster than they have historically. That’s a super interesting future world.