Super Simple Steps: Multimodal Input (Part 3)

Disclaimer: This series documents patterns and code from building Thrifty Trip, my personal side project. All code examples, architectural decisions, and opinions are my own and are not related to my employer. Code is provided for educational purposes under the Apache 2.0 License.

In Part 2, we abstracted our prompts into separate modules for better version control and templating. We also introduced Zod for schema validation, guaranteeing valid JSON output and type safety.

This post introduces the pattern for multimodal input — sending images alongside text to get richer, more accurate analysis.

The Image Understanding Challenge

In the reselling and secondhand goods space, accurate representation of an item’s condition is critical. Browsing popular marketplaces will give you a crash course in the many ways an item can be misrepresented, such as:

nicks on a watch that is considered unworn;
heel and toe drag on a pair of Jordans that is in like new condition;
and, on the opposite end, a pair of jeans that is listed as fair condition even though it still has the original tags on it.

It is not all done out of malice. Many sellers are simply not aware of the impact their descriptions have on the value of their listings.

The Solution: Multimodal Input

Gemini’s multimodal capabilities let us send images directly alongside our prompts. The model “sees” the image and extracts structured data—no separate computer vision pipeline required.

The pattern is identical to text generation, with one addition: base64-encoded image data in the request.

As we did in Part 2, we extract our prompts and output schemas into separate modules for better version control and templating.

graph LR A[prompts.ts] -->|Template| B[index.ts] C[schema.ts] -->|Types| B D[Image URL] -->|Fetch & Encode| B B -->|Image + Prompt| E[Gemini API] E -->|JSON Response| B B -->|Validated Output| F[(Database)] style A fill:#34A853,stroke:#333,stroke-width:2px,color:white style C fill:#FBBC04,stroke:#333,stroke-width:2px style B fill:#4285F4,stroke:#333,stroke-width:2px,color:white style D fill:#E8F0FE,stroke:#4285F4,stroke-width:2px style E fill:#EA4335,stroke:#333,stroke-width:2px,color:white style F fill:#EA4335,stroke:#333,stroke-width:2px,color:white

Figure 1: The multimodal flow. Image data joins the prompt as input.

Prompt

In this example, we are ingesting an image and asking Gemini to evaluate for certain condition factors. Note, this is just a subset of evaluation features Thrifty Trip uses, but it is a good starting point for a resale listing.

From experience, even with structured output and Zod validation, Gemini can hallucinate and return invalid output, such as:

A quality score of 11
“light” instead of “lighting”
“OK” instead of “acceptable” focus

This makes adding explicit return instructions within the input prompt an essential step in our workflow. The input token cost is negligible compared to repairing broken pipelines and a subpar user experience.

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

/**
 * Image Analysis Prompt for Resale Listings
 * Analyzes product photos for quality, item details, and listing recommendations.
 *
 * Variables:
 * - itemContext: Optional context about the item (title, category, etc.)
 */
export const IMAGE_ANALYSIS_PROMPT = `Analyze this product image for a resale listing. Return JSON with EXACTLY these fields:

{
  "quality_score": <number 1-10>,
  "lighting": "<excellent|good|fair|poor>",
  "focus": "<sharp|acceptable|blurry>",
  "background": "<clean|acceptable|distracting>",
  "visible_brand": "<brand name or 'not visible'>",
  "condition_issues": "<visible flaws or 'none visible'>",
  "missing_shots": "<missing photo angles>",
  "summary": "<2-3 sentence assessment>"
}

{{ itemContext }}`;

Schema

We continue to use Zod for schema validation for output validation and type safety.

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

import { z } from "npm:zod";

/**
 * Schema for image analysis results.
 * Flattened to avoid nested object serialization issues with Gemini structured output.
 */
export const ImageAnalysisSchema = z.object({
  // Photo quality
  quality_score: z.number().describe("Overall photo quality score 1-10"),
  lighting: z
    .string()
    .describe("Lighting quality: excellent, good, fair, or poor"),
  focus: z.string().describe("Image focus: sharp, acceptable, or blurry"),
  background: z
    .string()
    .describe("Background quality: clean, acceptable, or distracting"),

  // Item details
  visible_brand: z.string().describe("Brand name if visible, or 'not visible'"),
  condition_issues: z
    .string()
    .describe("Visible flaws, stains, tears, or 'none visible'"),

  // Recommendations
  missing_shots: z.string().describe("Important photo angles that are missing"),
  summary: z.string().describe("2-3 sentence assessment for resale listing"),
});

export type ImageAnalysisType = z.infer<typeof ImageAnalysisSchema>;

Key insight: Keep schemas flat. Nested objects can cause serialization issues with Gemini’s structured output mode.

Implementation

Step 1: Fetch and Encode the Image

The first step of the analysis is to fetch and transform the image into a base64 string. This is essential for the Gemini API as it does not accept a URL as input.

Important definitions to understand:

Base64: A way to encode binary data as text, using a set of 64 characters (A-Z, a-z, 0-9, +, /, and =).
ArrayBuffer: A generic type for binary data, used to store raw data in memory. In memory means it is stored in the computer’s RAM, as opposed to on disk or in a database.
Uint8Array: An array of 8-bit unsigned integers, used to represent binary data. Unsigned integers means it can only be positive, as opposed to signed integers which can be positive or negative.
- note on signed vs. unsigned: we cannot use signed integers here because the data is binary.
mimeType: mime means “multipurpose internet mail extension”, and it is a standard way of representing the type of data in a file or stream (such as image/png or image/jpeg). Think of this as media type.

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

// Fetch image and convert to base64
async function fetchImageAsBase64(
  imageUrl: string
): Promise<{ data: string; mimeType: string }> {
  console.log("[analyze-image] Fetching image from URL...");

  const response = await fetch(imageUrl);
  if (!response.ok) {
    throw new Error(
      `Failed to fetch image: ${response.status} ${response.statusText}`
    );
  }

  const contentType = response.headers.get("content-type") || "image/jpeg";
  const arrayBuffer = await response.arrayBuffer();
  const uint8Array = new Uint8Array(arrayBuffer);

  // Convert to base64 using Deno's built-in encoding
  let binary = "";
  for (let i = 0; i < uint8Array.length; i++) {
    binary += String.fromCharCode(uint8Array[i]);
  }
  const base64Data = btoa(binary);

  console.log(
    "[analyze-image] Image fetched. Size:",
    uint8Array.length,
    "bytes, Type:",
    contentType
  );

  return {
    data: base64Data,
    mimeType: contentType,
  };
}

Step 2: Call Gemini with the Image

Once we have the base64-encoded image, we call Gemini with both the image and our prompt. Note:

VISION_MODEL is set via environment variable (gemini-2.5-flash by default; cost-effective, accurate)
The image goes in inlineData, the prompt in text
We use the same structured output pattern from Part 2
contentText is the optional item context we are passing to the prompt. If there are missing attributes, we encode them as NULL.

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

import { GoogleGenAI } from "npm:@google/genai";
import { zodToJsonSchema } from "npm:zod-to-json-schema";
import { ImageAnalysisSchema } from "./schema.ts";
import { interpolate, IMAGE_ANALYSIS_PROMPT } from "./prompts.ts";

console.log("[analyze-image] Starting image analysis for user:", userId);

// Fetch and encode the image
const imageData = await fetchImageAsBase64(image_url);

// Build the prompt with optional item context
const contextText = item_context
  ? `ITEM CONTEXT:\n${
      typeof item_context === "string"
        ? item_context
        : JSON.stringify(item_context, null, 2)
    }\n\n`
  : "";

const prompt = interpolate(IMAGE_ANALYSIS_PROMPT, {
  itemContext: contextText,
});

console.log("[analyze-image] Calling Gemini vision with model:", VISION_MODEL);

// Call Gemini with the image
const response = await ai.models.generateContent({
  model: VISION_MODEL,
  contents: [
    {
      role: "user",
      parts: [
        {
          inlineData: {
            mimeType: imageData.mimeType,
            data: imageData.data,
          },
        },
        { text: prompt },
      ],
    },
  ],
  config: {
    temperature: 0.3,
    responseMimeType: "application/json",
    responseJsonSchema: zodToJsonSchema(ImageAnalysisSchema),
  },
});

// Parse and validate the response
const rawJsonText = response.text || "";
const analysis = ImageAnalysisSchema.parse(JSON.parse(rawJsonText));

// Now fully typed: analysis.quality_score, analysis.condition_issues, etc.
console.log(`Photo quality: ${analysis.quality_score}/10`);
console.log(`Issues found: ${analysis.condition_issues}`);

Figure 2: The complete flow—fetch image, call API, validate response.

Results in Production

In Thrifty Trip, this pattern powers:

Photo quality scoring — Instant feedback on lighting, focus, and composition
Condition detection — Spots stains, tears, and wear that sellers might miss
Missing shot suggestions — Recommends additional angles (tags, soles, labels)
Authenticity flagging — Notes quality indicators and potential concerns

Sellers get actionable feedback before listing. Buyers get accurate representations. Everyone wins.

Key Takeaways

Multimodal is just another input type. Add inlineData alongside text in your request.
Base64 encoding is required. Gemini doesn’t fetch URLs—you must send the image data.
Same structured output pattern. Zod schemas work identically for vision responses.
Keep schemas flat. Avoid nested objects to prevent serialization issues.
Prompt redundancy helps. Explicit field instructions reduce hallucinations even with schemas.

What’s Next

We’ve now covered text generation, structured output, and multimodal input—the three core primitives for most AI features.

But what if you want to find similar items? Or let users search their inventory with natural language like “red vintage Nike jacket”?

In the next post, “Super Simple Steps: Embeddings & Semantic Search,” I’ll show you how to generate embeddings, store them in Supabase’s vector database, and build a similarity search that actually understands what users mean.

Series Roadmap

Generative AI — The basic primitive
Structured Output — Prompt version control & Zod schemas
Multimodal Input (this post) — Processing images with AI
Embeddings & Semantic Search — Finding similar items
Grounding & Search — Connecting AI to real-time data
The Batch API — Processing thousands of items efficiently
Building an AI Agent — Giving AI tools to solve problems
Evaluating Success — Testing and measuring quality

This series documents real patterns from building Thrifty Trip, a production inventory management app for fashion resellers. Code samples are available under the Apache 2.0 License.