
Disclaimer: This series documents patterns and code from building Thrifty Trip, my personal side project. All code examples, architectural decisions, and opinions are my own and are not related to my employer. Code is provided for educational purposes under the Apache 2.0 License.
In Part 2, we abstracted our prompts into separate modules for better version control and templating. We also introduced Zod for schema validation, guaranteeing valid JSON output and type safety.
This post introduces the pattern for multimodal input — sending images alongside text to get richer, more accurate analysis.
In the reselling and secondhand goods space, accurate representation of an item’s condition is critical. Browsing popular marketplaces will give you a crash course in the many ways an item can be misrepresented, such as:
unworn;like new condition;fair condition even though it still has the original tags on it.It is not all done out of malice. Many sellers are simply not aware of the impact their descriptions have on the value of their listings.
Gemini’s multimodal capabilities let us send images directly alongside our prompts. The model “sees” the image and extracts structured data—no separate computer vision pipeline required.
The pattern is identical to text generation, with one addition: base64-encoded image data in the request.
As we did in Part 2, we extract our prompts and output schemas into separate modules for better version control and templating.
Figure 1: The multimodal flow. Image data joins the prompt as input.
In this example, we are ingesting an image and asking Gemini to evaluate for certain condition factors. Note, this is just a subset of evaluation features Thrifty Trip uses, but it is a good starting point for a resale listing.
From experience, even with structured output and Zod validation, Gemini can hallucinate and return invalid output, such as:
This makes adding explicit return instructions within the input prompt an essential step in our workflow. The input token cost is negligible compared to repairing broken pipelines and a subpar user experience.
/*
* Copyright 2025 Thrifty Trip LLC
* SPDX-License-Identifier: Apache-2.0
*/
/**
* Image Analysis Prompt for Resale Listings
* Analyzes product photos for quality, item details, and listing recommendations.
*
* Variables:
* - itemContext: Optional context about the item (title, category, etc.)
*/
export const IMAGE_ANALYSIS_PROMPT = `Analyze this product image for a resale listing. Return JSON with EXACTLY these fields:
{
"quality_score": <number 1-10>,
"lighting": "<excellent|good|fair|poor>",
"focus": "<sharp|acceptable|blurry>",
"background": "<clean|acceptable|distracting>",
"visible_brand": "<brand name or 'not visible'>",
"condition_issues": "<visible flaws or 'none visible'>",
"missing_shots": "<missing photo angles>",
"summary": "<2-3 sentence assessment>"
}
{{ itemContext }}`;
We continue to use Zod for schema validation for output validation and type safety.
/*
* Copyright 2025 Thrifty Trip LLC
* SPDX-License-Identifier: Apache-2.0
*/
import { z } from "npm:zod";
/**
* Schema for image analysis results.
* Flattened to avoid nested object serialization issues with Gemini structured output.
*/
export const ImageAnalysisSchema = z.object({
// Photo quality
quality_score: z.number().describe("Overall photo quality score 1-10"),
lighting: z
.string()
.describe("Lighting quality: excellent, good, fair, or poor"),
focus: z.string().describe("Image focus: sharp, acceptable, or blurry"),
background: z
.string()
.describe("Background quality: clean, acceptable, or distracting"),
// Item details
visible_brand: z.string().describe("Brand name if visible, or 'not visible'"),
condition_issues: z
.string()
.describe("Visible flaws, stains, tears, or 'none visible'"),
// Recommendations
missing_shots: z.string().describe("Important photo angles that are missing"),
summary: z.string().describe("2-3 sentence assessment for resale listing"),
});
export type ImageAnalysisType = z.infer<typeof ImageAnalysisSchema>;
Key insight: Keep schemas flat. Nested objects can cause serialization issues with Gemini’s structured output mode.
The first step of the analysis is to fetch and transform the image into a base64 string. This is essential for the Gemini API as it does not accept a URL as input.
Important definitions to understand:
mime means “multipurpose internet mail extension”, and it is a standard way of representing the type of data in a file or stream (such as image/png or image/jpeg). Think of this as media type./*
* Copyright 2025 Thrifty Trip LLC
* SPDX-License-Identifier: Apache-2.0
*/
// Fetch image and convert to base64
async function fetchImageAsBase64(
imageUrl: string
): Promise<{ data: string; mimeType: string }> {
console.log("[analyze-image] Fetching image from URL...");
const response = await fetch(imageUrl);
if (!response.ok) {
throw new Error(
`Failed to fetch image: ${response.status} ${response.statusText}`
);
}
const contentType = response.headers.get("content-type") || "image/jpeg";
const arrayBuffer = await response.arrayBuffer();
const uint8Array = new Uint8Array(arrayBuffer);
// Convert to base64 using Deno's built-in encoding
let binary = "";
for (let i = 0; i < uint8Array.length; i++) {
binary += String.fromCharCode(uint8Array[i]);
}
const base64Data = btoa(binary);
console.log(
"[analyze-image] Image fetched. Size:",
uint8Array.length,
"bytes, Type:",
contentType
);
return {
data: base64Data,
mimeType: contentType,
};
}
Once we have the base64-encoded image, we call Gemini with both the image and our prompt. Note:
VISION_MODEL is set via environment variable (gemini-2.5-flash by default; cost-effective, accurate)inlineData, the prompt in textcontentText is the optional item context we are passing to the prompt. If there are missing attributes, we encode them as NULL./*
* Copyright 2025 Thrifty Trip LLC
* SPDX-License-Identifier: Apache-2.0
*/
import { GoogleGenAI } from "npm:@google/genai";
import { zodToJsonSchema } from "npm:zod-to-json-schema";
import { ImageAnalysisSchema } from "./schema.ts";
import { interpolate, IMAGE_ANALYSIS_PROMPT } from "./prompts.ts";
console.log("[analyze-image] Starting image analysis for user:", userId);
// Fetch and encode the image
const imageData = await fetchImageAsBase64(image_url);
// Build the prompt with optional item context
const contextText = item_context
? `ITEM CONTEXT:\n${
typeof item_context === "string"
? item_context
: JSON.stringify(item_context, null, 2)
}\n\n`
: "";
const prompt = interpolate(IMAGE_ANALYSIS_PROMPT, {
itemContext: contextText,
});
console.log("[analyze-image] Calling Gemini vision with model:", VISION_MODEL);
// Call Gemini with the image
const response = await ai.models.generateContent({
model: VISION_MODEL,
contents: [
{
role: "user",
parts: [
{
inlineData: {
mimeType: imageData.mimeType,
data: imageData.data,
},
},
{ text: prompt },
],
},
],
config: {
temperature: 0.3,
responseMimeType: "application/json",
responseJsonSchema: zodToJsonSchema(ImageAnalysisSchema),
},
});
// Parse and validate the response
const rawJsonText = response.text || "";
const analysis = ImageAnalysisSchema.parse(JSON.parse(rawJsonText));
// Now fully typed: analysis.quality_score, analysis.condition_issues, etc.
console.log(`Photo quality: ${analysis.quality_score}/10`);
console.log(`Issues found: ${analysis.condition_issues}`);
Figure 2: The complete flow—fetch image, call API, validate response.
In Thrifty Trip, this pattern powers:
Sellers get actionable feedback before listing. Buyers get accurate representations. Everyone wins.
inlineData alongside text in your request.We’ve now covered text generation, structured output, and multimodal input—the three core primitives for most AI features.
But what if you want to find similar items? Or let users search their inventory with natural language like “red vintage Nike jacket”?
In the next post, “Super Simple Steps: Embeddings & Semantic Search,” I’ll show you how to generate embeddings, store them in Supabase’s vector database, and build a similarity search that actually understands what users mean.
This series documents real patterns from building Thrifty Trip, a production inventory management app for fashion resellers. Code samples are available under the Apache 2.0 License.