Gemini 3.5 Flash API Tutorial: Authentication, Rate Limits, and Example Requests

Gemini 3.5 Flash API tutorial is a practical guide for developers who want fast, cost-efficient text generation and chat experiences through Google's Gemini API. Flash models are designed for low latency and high throughput, making them a strong fit for chatbots, interactive tools, and high-volume content workflows. Gemini 3.5 Flash follows the same REST and SDK mechanics as earlier Gemini Flash generations, so you can reuse proven patterns for authentication, quota management, and request structure.
What is Gemini 3.5 Flash, and How Do You Access It?
Gemini is a family of models that has progressed from 1.0 and 1.5 through 2.0 and into the 3.x generation. Within each generation, Flash variants prioritize speed and throughput, typically at a lower cost than more capable tiers like Pro or Ultra. Google exposes these models through a unified set of endpoints where the primary difference is the model ID you specify, such as gemini-1.5-flash, gemini-2.0-flash, and gemini-3.5-flash.

You can access Gemini 3.5 Flash in two main ways:
- Gemini API (AI Studio API key): best for individual developers, prototypes, and quick integrations.
- Vertex AI (Google Cloud credentials): best for enterprise workloads requiring IAM, centralized governance, and quota management.
Both options support client libraries and raw REST calls. The canonical REST pattern uses endpoints like /v1/models/{model}:generateContent and a corresponding streaming endpoint.
Authentication for the Gemini 3.5 Flash API
Your authentication choice depends on whether you are building a prototype or running production workloads with governance requirements. A common progression is to start with an AI Studio key, then migrate to Vertex AI as the application matures.
Option 1: AI Studio API Key (Gemini API)
To use the Gemini API, create an API key in Google AI Studio and store it securely. For local development, set it as an environment variable. Google recognizes these variable names:
- GEMINI_API_KEY
- GOOGLE_API_KEY (takes precedence if both are set)
Example (macOS or Linux):
export GEMINI_API_KEY="your_api_key_here"
Security note: do not commit API keys to source control. Use a secrets manager in CI and production environments.
Using the API Key with the Python SDK
Google's GenAI SDKs can auto-detect your key from environment variables, which simplifies deployment configuration. Python example:
pip install -U google-genai
import os
from google import genai
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Explain how rate limiting works in the Gemini API."
)
print(response.text)
Using the API Key with the Node.js SDK
Node.js example using @google/genai:
npm install @google/genai
import { GoogleGenerativeAI } from "@google/genai";
const client = new GoogleGenerativeAI({
apiKey: process.env.GEMINI_API_KEY,
});
From here, you call the SDK's generate methods in the same way as Python, passing a model name and your prompt contents.
Option 2: Vertex AI Authentication (Google Cloud Credentials)
For enterprise environments, Vertex AI is generally preferred because it integrates with IAM, service accounts, logging, and quota governance. A standard setup uses Application Default Credentials (ADC):
- Create a Google Cloud project and enable the relevant Gemini or Vertex AI APIs.
- Create a service account with a role such as Vertex AI User.
- Download the JSON key file for local testing, or use workload identity in production.
- Set GOOGLE_APPLICATION_CREDENTIALS to the JSON key path.
- Set GOOGLE_CLOUD_PROJECT (or the equivalent project ID variable) for your tooling.
This approach reduces reliance on developer-managed API keys and supports centralized access controls.
Gemini 3.5 Flash Rate Limits and Quotas
Rate limits for Gemini 3.5 Flash are quota-based rather than a single fixed threshold applied to all users. Limits vary based on several factors:
- Billing status (free tier vs. paid)
- Channel (AI Studio Gemini API vs. Vertex AI)
- Model tier (Flash vs. Pro vs. Ultra)
- Region and project configuration
In practice, you will encounter quotas expressed as requests per minute (RPM) and tokens per minute (TPM). Flash models are designed for higher throughput than Pro or Ultra, which is why they are commonly used for chat and high-volume workloads.
What Happens When You Exceed Rate Limits?
When you exceed rate limits, the API returns an HTTP 429 Too Many Requests error with a message indicating quota exhaustion. Your client should handle 429 responses by implementing:
- Exponential backoff with jitter
- Retry budgets to prevent retry storms
- Queueing for bursty traffic, especially in multi-tenant backends
- Request shaping such as batching where appropriate
How to Check and Manage Your Quotas
- AI Studio: review the usage dashboard to see current consumption and remaining quota.
- Google Cloud Console (Vertex AI): review quotas for Vertex AI and Gemini-related services, and submit quota increase requests for production workloads.
If you are building a multi-tenant API, plan your quota strategy early. A common architecture implements per-tenant throttles in your backend so that one tenant cannot exhaust the entire project quota.
Example REST Requests for Gemini 3.5 Flash
The core REST call for text generation is a POST to the generateContent method. The API key is passed as a query parameter.
Text Generation with curl
curl \
-X POST \
-H "Content-Type: application/json" \
"https://generativelanguage.googleapis.com/v1/models/gemini-3.5-flash:generateContent?key=${GEMINI_API_KEY}" \
-d '{
"contents": [
{
"parts": [
{ "text": "Summarize the key differences between Gemini Flash and Pro models." }
]
}
],
"generationConfig": {
"temperature": 0.7,
"maxOutputTokens": 512
}
}'
The response includes a candidates array containing generated content parts. Most applications read the first candidate and extract the text field for display.
Streaming Responses for Chat UIs
Streaming suits interactive applications where you want to render output as it is generated. In Python, the SDK wraps the streaming endpoint in a convenient iterator:
from google import genai
client = genai.Client()
stream = client.models.generate_content_stream(
model="gemini-3.5-flash",
contents="Write a short tutorial outline for Gemini 3.5 Flash."
)
for chunk in stream:
if chunk.text:
print(chunk.text, end="", flush=True)
For production UIs, consider buffering and emitting partial tokens at sensible intervals to avoid UI thrashing.
Multi-Turn Chat Example
For conversational workflows, use the SDK chat helper to maintain conversation state on the client side and send history with each request:
from google import genai
client = genai.Client()
chat = client.chats.create(model="gemini-3.5-flash")
while True:
user_input = input("You: ")
if user_input.lower() in {"exit", "quit"}:
break
response = chat.send_message(user_input)
print("Model:", response.text)
This pattern serves as a baseline for assistants, internal help desks, and customer support bots where Flash's low latency directly affects user experience.
Tools, Grounding, and Multimodal Inputs
Beyond plain text, the Gemini API supports capabilities such as Google Search grounding, code execution, URL context, file search, and computer use. A simple grounding example with Google Search:
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="What is the latest guidance on EU AI regulation?",
tools=[{"google_search_retrieval": {}}]
)
print(response.text)
When designing production workflows with tools, treat them as external dependencies: log tool calls, track latency, and apply allowlists to reduce risk.
Practical Deployment Checklist
- Choose the right authentication path: AI Studio key for prototypes, Vertex AI for enterprise controls.
- Centralize secrets: use a secrets manager, rotate keys regularly, and limit access scope.
- Handle rate limits properly: implement retries with exponential backoff for 429 errors, plus queueing and per-tenant throttles.
- Monitor usage: track RPM, TPM, latency, and error rates in production.
- Choose Flash vs. Pro deliberately: Flash for throughput and responsiveness, Pro for tasks requiring deeper reasoning.
Conclusion
This Gemini 3.5 Flash API tutorial covered the core elements needed to build reliable integrations: authentication via AI Studio API keys or Vertex AI credentials, quota-driven rate limits with 429 handling strategies, and concrete REST and SDK examples for text generation, streaming, and multi-turn chat. Because Gemini 3.5 Flash shares the same endpoint patterns as other Gemini Flash models, you can begin with the documented request formats immediately and scale using Vertex AI quotas and IAM as your workload grows.
For teams looking to formalize AI development skills and secure implementation practices, Blockchain Council offers structured training programs in AI development, prompt engineering, and cybersecurity that support production-ready standards across engineering organizations.
Related Articles
View AllAI & ML
Gemini Spark for Developers: API Integration Guide with Example Projects
Learn how to build Spark-like AI agents using the Gemini API, Firebase AI Logic, and Workspace integrations, with secure tool-calling patterns and example projects.
AI & ML
Gemini 3.5 Flash in Education: Personalized Learning Paths and Assessments at Scale
Explore how Gemini 3.5 Flash enables personalized learning paths and scalable assessments using long context, multimodal inputs, and agentic workflows.
AI & ML
Deploying Gemini 2.5 Flash Apps on Google Cloud: Serverless Patterns with Cloud Run and Functions
Learn serverless patterns to deploy Gemini 2.5 Flash apps on Google Cloud using Cloud Run and Cloud Functions, plus security, streaming, and cost controls.
Trending Articles
AWS Career Roadmap
A step-by-step guide to building a successful career in Amazon Web Services cloud computing.
How Blockchain Secures AI Data
Understand how blockchain technology is being applied to protect the integrity and security of AI training data.
What is AWS? A Beginner's Guide to Cloud Computing
Everything you need to know about Amazon Web Services, cloud computing fundamentals, and career opportunities.