SeanMiller
Blog

Super Simple Steps: Multimodal Input (Part 3)

Sean Miller
#blog#series#instructions#tutorials#generative ai#multimodal input

Super Simple Steps: Multimodal Input (Part 3)

Disclaimer: This series documents patterns and code from building Thrifty Trip, my personal side project. All code examples, architectural decisions, and opinions are my own and are not related to my employer. Code is provided for educational purposes under the Apache 2.0 License.

In Part 2, we abstracted our prompts into separate modules for better version control and templating. We also introduced Zod for schema validation, guaranteeing valid JSON output and type safety.

This post introduces the pattern for multimodal input — sending images alongside text to get richer, more accurate analysis.

The Image Understanding Challenge

In the reselling and secondhand goods space, accurate representation of an item’s condition is critical. Browsing popular marketplaces will give you a crash course in the many ways an item can be misrepresented, such as:

It is not all done out of malice. Many sellers are simply not aware of the impact their descriptions have on the value of their listings.

The Solution: Multimodal Input

Gemini’s multimodal capabilities let us send images directly alongside our prompts. The model “sees” the image and extracts structured data—no separate computer vision pipeline required.

The pattern is identical to text generation, with one addition: base64-encoded image data in the request.

As we did in Part 2, we extract our prompts and output schemas into separate modules for better version control and templating.

graph LR A[prompts.ts] -->|Template| B[index.ts] C[schema.ts] -->|Types| B D[Image URL] -->|Fetch & Encode| B B -->|Image + Prompt| E[Gemini API] E -->|JSON Response| B B -->|Validated Output| F[(Database)] style A fill:#34A853,stroke:#333,stroke-width:2px,color:white style C fill:#FBBC04,stroke:#333,stroke-width:2px style B fill:#4285F4,stroke:#333,stroke-width:2px,color:white style D fill:#E8F0FE,stroke:#4285F4,stroke-width:2px style E fill:#EA4335,stroke:#333,stroke-width:2px,color:white style F fill:#EA4335,stroke:#333,stroke-width:2px,color:white

Figure 1: The multimodal flow. Image data joins the prompt as input.

Prompt

In this example, we are ingesting an image and asking Gemini to evaluate for certain condition factors. Note, this is just a subset of evaluation features Thrifty Trip uses, but it is a good starting point for a resale listing.

From experience, even with structured output and Zod validation, Gemini can hallucinate and return invalid output, such as:

This makes adding explicit return instructions within the input prompt an essential step in our workflow. The input token cost is negligible compared to repairing broken pipelines and a subpar user experience.

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

/**
 * Image Analysis Prompt for Resale Listings
 * Analyzes product photos for quality, item details, and listing recommendations.
 *
 * Variables:
 * - itemContext: Optional context about the item (title, category, etc.)
 */
export const IMAGE_ANALYSIS_PROMPT = `Analyze this product image for a resale listing. Return JSON with EXACTLY these fields:

{
  "quality_score": <number 1-10>,
  "lighting": "<excellent|good|fair|poor>",
  "focus": "<sharp|acceptable|blurry>",
  "background": "<clean|acceptable|distracting>",
  "visible_brand": "<brand name or 'not visible'>",
  "condition_issues": "<visible flaws or 'none visible'>",
  "missing_shots": "<missing photo angles>",
  "summary": "<2-3 sentence assessment>"
}

{{ itemContext }}`;

Schema

We continue to use Zod for schema validation for output validation and type safety.

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

import { z } from "npm:zod";

/**
 * Schema for image analysis results.
 * Flattened to avoid nested object serialization issues with Gemini structured output.
 */
export const ImageAnalysisSchema = z.object({
  // Photo quality
  quality_score: z.number().describe("Overall photo quality score 1-10"),
  lighting: z
    .string()
    .describe("Lighting quality: excellent, good, fair, or poor"),
  focus: z.string().describe("Image focus: sharp, acceptable, or blurry"),
  background: z
    .string()
    .describe("Background quality: clean, acceptable, or distracting"),

  // Item details
  visible_brand: z.string().describe("Brand name if visible, or 'not visible'"),
  condition_issues: z
    .string()
    .describe("Visible flaws, stains, tears, or 'none visible'"),

  // Recommendations
  missing_shots: z.string().describe("Important photo angles that are missing"),
  summary: z.string().describe("2-3 sentence assessment for resale listing"),
});

export type ImageAnalysisType = z.infer<typeof ImageAnalysisSchema>;

Key insight: Keep schemas flat. Nested objects can cause serialization issues with Gemini’s structured output mode.

Implementation

Step 1: Fetch and Encode the Image

The first step of the analysis is to fetch and transform the image into a base64 string. This is essential for the Gemini API as it does not accept a URL as input.

Important definitions to understand:

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

// Fetch image and convert to base64
async function fetchImageAsBase64(
  imageUrl: string
): Promise<{ data: string; mimeType: string }> {
  console.log("[analyze-image] Fetching image from URL...");

  const response = await fetch(imageUrl);
  if (!response.ok) {
    throw new Error(
      `Failed to fetch image: ${response.status} ${response.statusText}`
    );
  }

  const contentType = response.headers.get("content-type") || "image/jpeg";
  const arrayBuffer = await response.arrayBuffer();
  const uint8Array = new Uint8Array(arrayBuffer);

  // Convert to base64 using Deno's built-in encoding
  let binary = "";
  for (let i = 0; i < uint8Array.length; i++) {
    binary += String.fromCharCode(uint8Array[i]);
  }
  const base64Data = btoa(binary);

  console.log(
    "[analyze-image] Image fetched. Size:",
    uint8Array.length,
    "bytes, Type:",
    contentType
  );

  return {
    data: base64Data,
    mimeType: contentType,
  };
}

Step 2: Call Gemini with the Image

Once we have the base64-encoded image, we call Gemini with both the image and our prompt. Note:

/*
 * Copyright 2025 Thrifty Trip LLC
 * SPDX-License-Identifier: Apache-2.0
 */

import { GoogleGenAI } from "npm:@google/genai";
import { zodToJsonSchema } from "npm:zod-to-json-schema";
import { ImageAnalysisSchema } from "./schema.ts";
import { interpolate, IMAGE_ANALYSIS_PROMPT } from "./prompts.ts";

console.log("[analyze-image] Starting image analysis for user:", userId);

// Fetch and encode the image
const imageData = await fetchImageAsBase64(image_url);

// Build the prompt with optional item context
const contextText = item_context
  ? `ITEM CONTEXT:\n${
      typeof item_context === "string"
        ? item_context
        : JSON.stringify(item_context, null, 2)
    }\n\n`
  : "";

const prompt = interpolate(IMAGE_ANALYSIS_PROMPT, {
  itemContext: contextText,
});

console.log("[analyze-image] Calling Gemini vision with model:", VISION_MODEL);

// Call Gemini with the image
const response = await ai.models.generateContent({
  model: VISION_MODEL,
  contents: [
    {
      role: "user",
      parts: [
        {
          inlineData: {
            mimeType: imageData.mimeType,
            data: imageData.data,
          },
        },
        { text: prompt },
      ],
    },
  ],
  config: {
    temperature: 0.3,
    responseMimeType: "application/json",
    responseJsonSchema: zodToJsonSchema(ImageAnalysisSchema),
  },
});

// Parse and validate the response
const rawJsonText = response.text || "";
const analysis = ImageAnalysisSchema.parse(JSON.parse(rawJsonText));

// Now fully typed: analysis.quality_score, analysis.condition_issues, etc.
console.log(`Photo quality: ${analysis.quality_score}/10`);
console.log(`Issues found: ${analysis.condition_issues}`);

Figure 2: The complete flow—fetch image, call API, validate response.

Results in Production

In Thrifty Trip, this pattern powers:

Sellers get actionable feedback before listing. Buyers get accurate representations. Everyone wins.

Key Takeaways

  1. Multimodal is just another input type. Add inlineData alongside text in your request.
  2. Base64 encoding is required. Gemini doesn’t fetch URLs—you must send the image data.
  3. Same structured output pattern. Zod schemas work identically for vision responses.
  4. Keep schemas flat. Avoid nested objects to prevent serialization issues.
  5. Prompt redundancy helps. Explicit field instructions reduce hallucinations even with schemas.

What’s Next

We’ve now covered text generation, structured output, and multimodal input—the three core primitives for most AI features.

But what if you want to find similar items? Or let users search their inventory with natural language like “red vintage Nike jacket”?

In the next post, “Super Simple Steps: Embeddings & Semantic Search,” I’ll show you how to generate embeddings, store them in Supabase’s vector database, and build a similarity search that actually understands what users mean.


Series Roadmap

  1. Generative AI — The basic primitive
  2. Structured Output — Prompt version control & Zod schemas
  3. Multimodal Input (this post) — Processing images with AI
  4. Embeddings & Semantic Search — Finding similar items
  5. Grounding & Search — Connecting AI to real-time data
  6. The Batch API — Processing thousands of items efficiently
  7. Building an AI Agent — Giving AI tools to solve problems
  8. Evaluating Success — Testing and measuring quality

This series documents real patterns from building Thrifty Trip, a production inventory management app for fashion resellers. Code samples are available under the Apache 2.0 License.

← Back to Blog