Table of Contents
- Preface
- Part 1: For All Solo Knowledge Workers
- Who This Book Is For
-
The Rise of the AI-Powered Solo Knowledge Worker
- AI Tools: Powering the Solo Revolution
- AI Tool Categories for Solo Knowledge Workers (May 2025)
- Beyond Automation: AI as Creative Catalyst, Market Navigator, and Personalization Engine
- Independent Consultants
- Programmers and Developers: Productivity Surge, Evolving Skillset
- Wrap Up for AI Powers Solo Work Revolution
- Key Concepts: LLMs, Generative AI, Multimodality
- Google’s AI Ecosystem
- Gemini Integration with Gmail, Google Docs, Google Sheets and Google Meet, Google Drive, Google Calendar
- Part 2 - Using Python Scripts
- Python Introduction and Tutorial
- First Steps Using Python with the Gemini Python Library
- Gemini Thinking Mode
- Using Gemini For Both Images and Text (Multimodal Capabilities)
- Using Gemini to Write Python for Google Colab
- Book Wrap Up
Preface
Most books I have written are deeply technical - this is not one of them. Here, I attempt to make state of the art AI approachable and usable to a wider and not necessarily technical audience. For those readers most interested in practical examples, you can skip to the chapter “Google’s AI Ecosystem.”
Since the 1980s, I’ve been deeply involved in the development of commercial Artificial Intelligence (AI) tools and applications. My journey has led me to deliver AI systems and conduct research for organizations like Capital One, Google, SAIC, DARPA, Olive AI, PacBell, and Babylist. I have 55 US patents and I have written over 20 books on AI.
For my work and personal research, I use Large Language Model (LLM)-based AI tools from various vendors, including OpenAI, Google, and Groq Cloud. I also run open models from Meta, DeepSeek, Mistral, and Alibaba on my personal Mac system using Ollama.
This book is fairly narrow in scope: it is written for solo knowledge workers who use the Google AI Platform. Other related topics are covered in several of my books that can be read online at https://leanpub.com/u/markwatson.
For the purposes of this book we consider solo knowledge workers to engage in activities of freelance creatives, small business owners, solo software developers, independent marketing people, general consultants and independent researchers.
Who is a Copilot for Whom?
The best advice for young people starting their education or careers (as well as older workers who are updating their skillsets) is:
Learn to use AI as an effective copilot for your work and research.
Learn to be an effective copilot for AIs that are helping you.
These will be recurring themes in this book. Please note that I am using the expression copilot as a general term, and I am not referring to Microsoft’s and GitHub’s Copilot commercial products.
Why is this Book Important to You? What is a Knowledge Worker?
There are many fine AI tools that you can use for your work and choosing the Google AI platform is a good choice. For my work I like to choose a particular platform and “stick with it” until either a project is done or other compelling technology and tools surface.
Many developers and knowledge workers have begun using Large Language Models (LLMs) without fully understanding their functionality, limitations, and optimal use for practical applications and personal knowledge worker assistance.
My dual goal is to intertwine an understanding of core technologies with practical techniques for using AI effectively in your job.
We can also define the “Solo Knowledge Worker” by building on Peter Drucker’s foundational concept of the knowledge worker: individuals whose main capital is knowledge and who rely on their intellectual skills. The modern knowledge worker utilizes cognitive, relational, and communicative faculties, often in collaboration with technology, to perform their work. Their focus is on generating value through thought and expertise, rather than merely completing tasks efficiently.
Technology exists to serve human interests. My hope is that this book empowers you to be more productive in your work, freeing up time for the people and activities that bring you joy and fulfillment.
I hope that you both enjoy this book and find it useful!

Structure of This Book
We start with material that non-programming tech savvy readers will have no problems working through: a general introduction of available tools and then cover general “non-programming” skills that you will want to master.
The last part of the book is aimed at readers who have some knowledge of the Python programming language. The programming examples are designed to be both simple to understand and useful. For more technical and software-intensive material please see my other books at https://leanpub.com/u/markwatson.
Things Said and Things Left Unsaid
I leave out most material specific to enterprise use of the Google AI Platform and concentrate on tools more likely used by individuals. As a result:
- We will concentrate on Gemini App, Google AI Studio App, and directly calling the Gemini model APIs (requires a simple to obtain API key)
- We will cover with simple examples the tight Gemini integration with the apps Gmail, Google Docs, Google Sheets, Google Meet, Google Drive, and Google Calendar.
- Relatively little coverage here for Google Vertex. Vertex AI is Google Cloud’s comprehensive, unified platform for building, deploying, and managing machine learning models and AI applications. If you are a professional software developer you can augment the material here with https://cloud.google.com/vertex-ai/docs.
- Little coverage of the Google WorkSpace product that is used by companies, schools, and government agencies. Individuals tend to use the Gemini app that interacts directly with data in the free Google apps like Gmail, Google Calendar, Google Docs, etc. I created a fresh Google WorkSPace account for the purpose of writing this book but most of the material pertains to the “free” Gmail and other Google apps (the setup I use for my personal work flow).
Acknowledgements
I would like to thank my wife, Carol Watson, for her unwavering support and for serving as the editor of this book.
I also want to express my gratitude to Leanpub for providing an excellent platform that makes publishing accessible and straightforward. I highly recommend Leanpub to other authors who wish to share their knowledge with the world.
I would like to thank the following readers who have reported errors or offered suggestions: none so far.
Part 1: For All Solo Knowledge Workers
This first section is useful for all solo knowledge workers. The second section will cover material for solo knowledge workers who are also Python developers.
- Who is this book written for?
- The Rise of the AI-Powered Solo Knowledge Worker
- Why Google’s AI Ecosystem?
- Who This Book Is For
Who This Book Is For
Solo Knowledge Workers are individuals who primarily use their thinking and intellectual abilities to work independently, often with a focus on autonomy and flexibility. This could range from managing their own business operations to handling projects on a contract basis as consultants. We will look at types of solo knowledge workers and how AI can help them in their work, research, and often just brain storming.
The modern economy is witnessing a significant rise in independent professionals. These individuals, operating outside traditional employment structures, leverage their intellectual skills and expertise to offer specialized services across diverse industries. While this path offers appealing benefits like autonomy and flexibility, it also presents unique challenges. Solo knowledge workers often face resource constraints, potential isolation, and the burden of managing all aspects of their business single-handedly, from marketing and administration to client delivery.
Freelance Creatives
Writers, designers, and marketers often face challenges such as inconsistent work, unclear project details, and the need to constantly find new clients. Google AI tools like Gemini, especially when integrated with Google Workspace, can be really helpful for them. For example, Gemini can assist with brainstorming content ideas, drafting initial versions of articles or marketing copy, and even generating images for their projects. Tools like NotebookLM can also speed up their research process by quickly summarizing documents and identifying key information.
Software Developers
I am a software developer and using LLMs has totally changed my work flows, including:
- Entering source code and runtime errors into a prompt and ask for debugging advice. Often systems like Gemini and ChatGPT will ask for additional debug printout statements to be added, then after having these additional debug printouts added to the chat context, LLMs can offer advice on fixing errors.
- Entering source code and asking for additional error handling to be added.
- Entering source code and asking for an LLM to make the code more concise, or alternatively ask for the code to be annotated with detailed comments.
- Writing new code.
As I write this chapter in May 2025, I estimate that on average using LLM-based tools reduces my development time on projects by about 50%.
AI for Marketing Tasks
Gemini can be used to analyze marketing data, identify trends, and even help draft compelling ad copy or social media posts. The ability to upload and analyze spreadsheets or other data files directly within Gemini or Google Sheets could save them a lot of time and effort in understanding campaign performance and making data-driven decisions. Even image generation tools within the Google ecosystem can be leveraged for creating marketing visuals, though it’s important to be aware of potential restrictions on branded content.
Consultants and Researchers
AI offers significant advantages for individual consultants and researchers, enhancing their productivity and the quality of their work. For consultants, AI tools can automate time-consuming tasks such as market research and data analysis, allowing them to focus on client interaction and strategic advice. Large Language Models (LLMs) like Gemini can assist in drafting proposals, creating reports, and even brainstorming innovative solutions tailored to specific client needs. AI-powered tools can also help consultants stay abreast of the latest industry trends and regulations by summarizing vast amounts of information quickly. Furthermore, AI can personalize client communication, identify potential new business opportunities through data pattern recognition, and optimize pricing strategies based on market dynamics, ultimately leading to more efficient operations and increased client satisfaction for solo consultants.
Researchers can leverage AI to accelerate the pace of discovery and improve the rigor of their investigations. LLMs can assist in literature reviews by efficiently processing and summarizing numerous academic papers, identifying key themes, and highlighting potential gaps in current knowledge. AI tools can also aid in data analysis, helping to identify complex patterns and correlations in large datasets that might be missed through traditional methods. In fields like scientific research, AI can even assist in generating hypotheses and designing experiments. Moreover, AI-powered writing and editing tools can help researchers refine their manuscripts, ensuring clarity and accuracy before publication. By automating routine tasks and providing powerful analytical capabilities, AI empowers individual researchers to dedicate more time to critical thinking, creative problem-solving, and pushing the boundaries of their respective fields.
For researchers, Google’s NotebookLM offers a transformative approach to managing and synthesizing information. Imagine uploading a collection of research papers, interview transcripts, and field notes, and then being able to ask NotebookLM nuanced questions that draw connections across all those documents. Instead of manually sifting through disparate sources, researchers can use NotebookLM to quickly identify key themes, compare and contrast different findings, and extract specific evidence to support their arguments. This capability streamlines the literature review process, facilitates deeper understanding of complex topics, and ultimately frees up researchers to focus on higher-level analysis and the development of novel insights.
The Rise of the AI-Powered Solo Knowledge Worker
Dear reader, as I write this in May 2025, the world economy is going through a lot of changes. Large corporations, taking advantage of the “AI productivity boost,” have either cut back on hiring or in a lot of cases have reduced head count. For many workers, being an independent consultant or starting a business is now an economic necessity and for others the control of one’s economic future is a benefit.
For whatever reason, more people are working outside the realm of traditional jobs with corporations or in government.
AI Tools: Powering the Solo Revolution
The emergence of the AI-powered solo knowledge worker is fundamentally enabled by an expanding ecosystem of sophisticated yet accessible Artificial Intelligence tools. As of May 2025, this landscape has moved beyond novelty, offering individuals capabilities that, until recently, were the exclusive domain of large organizations with dedicated teams and substantial budgets. Understanding the specific categories of these tools and the deeper benefits they confer beyond mere task automation is crucial to grasping the magnitude of this shift.
AI Tool Categories for Solo Knowledge Workers (May 2025)
The following table shows a subset of available AI tools, their key functionality, and potential benefits fro solo knowledge workers:
Category | Example Tools (Source IDs) | Key AI-Enabled Functionality | Benefit for Solo Worker |
---|---|---|---|
Marketing & Sales | HubSpot CRM, Semrush, Adzooma, involve.me, Wisepops, Wishpond, Clay, Instantly.ai | CRM integration, SEO analysis/optimization, ad campaign management, interactive lead funnels, behavior-based pop-ups, AI website/copy generation, competitive intelligence, automated outreach | Design, execute, and analyze complex marketing campaigns; generate leads; optimize ad spend; gain market insights; acquire clients efficiently. |
Content Creation & Design | GPT/Claude/Gemini, Jasper/Copy.ai, Midjourney/DALL-E 3, Synthesia/Runway/Zebracat, Canva/Visme/Adobe Firefly | Text generation, grammar/style check, image/video/voice/music generation, graphic design templates/suggestions, background removal, logo design, font pairing | Produce professional-grade text, visuals, audio, and video content quickly; overcome creative blocks; create marketing assets; maintain brand consistency across multiple formats. |
Operations & Productivity | Asana/ClickUp, Notion AI, Reclaim/Clockwise, Fathom/Fireflies, Guru, Zapier, Motion | Project/task management automation, AI scheduling, meeting transcription/summarization, contextual knowledge delivery, workflow automation across apps, smart task prioritization | Reduce administrative burden; streamline workflows; improve time management; enhance collaboration (if applicable); centralize knowledge; focus cognitive resources on core tasks. |
Analysis & Decision Support | Tableau, Power BI, Alteryx, Rows, quantilope, Google AutoML, KNIME | Data visualization, natural language querying, predictive analytics, automated reporting, automated market research methods/analysis, custom ML model building (no-code), data workflow design | Gain deeper insights from data; make data-driven decisions; forecast trends; automate complex analysis and research; create sophisticated reports and visualizations without deep technical expertise. |
Client Management & Comm. | Intercom, Pylon, Rasa, Tidio AI, Guru, SaneBox | AI chatbots, automated support responses, AI-generated support articles, contextual knowledge for support, intelligent email management | Provide scalable and responsive 24/7 client support; personalize interactions; manage communication efficiently; build stronger client relationships. |
Specialized & Technical | Cursor/Copilot, Bubble/v0, Duda/Wix/Shopify, Harvey/Abridge, DataLab AI, Talend | AI code generation/debugging, no-code app building, AI website design/content, legal/medical AI assistants, AI-assisted data science coding, ML algorithm integration | Accelerate development cycles; build applications/websites faster; access domain-specific AI knowledge; streamline technical workflows in specialized fields like law, medicine, or data science. |
Beyond Automation: AI as Creative Catalyst, Market Navigator, and Personalization Engine
While automation of routine tasks is a significant benefit, the true transformative potential of AI for solo knowledge workers lies in its ability to augment higher-order functions, enabling them to perform better or engage in activities previously out of reach:
- Enhanced Creativity: AI is becoming a creative collaborator, not just a tool for execution. Design platforms such as Canva leverage AI to suggest layout improvements or stylistic variations. AI assistants like Gemini can facilitate brainstorming sessions. By handling some of the more mechanical aspects of creation and offering unexpected inputs, AI acts as a tireless collaborator and skill amplifier. This allows solo creators to explore more ambitious projects and experiment with new styles or mediums that might have previously required extensive training or collaboration with other specialists.
- Democratized Market Access & Intelligence: Historically, deep market understanding and broad outreach required significant resources. AI is leveling this playing field allowing solo knowledge workers to compete vigorously for niche markets.
- Hyper-Personalized Service Delivery: AI enables solo practitioners to deliver highly personalized experiences to their clients, fostering stronger relationships and differentiation.
Independent Consultants
For independent consultants, AI presents a powerful opportunity to enhance efficiency and compete effectively with larger firms. By leveraging AI for data analysis and process automation, they can deliver insights and solutions more rapidly. Their value proposition increasingly lies not in manual data processing or standard framework application, but in strategic interpretation, customized advice tailored to specific client contexts, and the ability to orchestrate AI tools effectively. Offering specialized services related to AI strategy or implementation also represents a growing market niche.
Programmers and Developers: Productivity Surge, Evolving Skillset
Dear reader, this topic hits home for me because I have spent my adult life working as a programer/computer scientist and in the last two years AI tools have revolutionized my work flow as well as my personal research.
Software development is being profoundly impacted by AI, primarily through coding assistants like GitHub Copilot, Aider, Cursor, and others. These tools significantly accelerate the process of writing code, with substantial portions of new code now being AI-generated. AI also assists in debugging, code optimization, and even aspects of testing. The rise of low-code and no-code platforms, often incorporating AI features, further abstracts traditional coding tasks, potentially reducing demand for some routine programming roles.
Earlier we discussed the two-way relationship between human knowledge worker and AI: sometimes AI is the copilot, asked to write code, documentation, unit tests, etc. Other times the human worker is the copilot, asked to add debug printout or provide further information or guidance.
For solo freelance developers, AI coding tools offer a significant productivity advantage, allowing them to take on larger projects or deliver faster. However, their value proposition must evolve beyond just writing code. Success will increasingly depend on cultivating expertise in software architecture, security best practices, quality assurance, AI integration, and the ability to effectively manage and validate AI-generated code. Offering specialized services in areas like AI model development, AI system integration, or AI code auditing presents a valuable niche. Clear communication with clients regarding the use of AI tools and the processes for ensuring code quality and security will also be crucial.
Wrap Up for AI Powers Solo Work Revolution
The World Economic Forum predicted that a majority of employees would require significant reskilling in 2025 and beyond due to AI’s impact. Solo workers, lacking the structured training programs of large organizations, must take personal responsibility for staying abreast of developments. This involves not only learning how to use new tools but also understanding their underlying capabilities, limitations, and appropriate use cases.
Strategies for solo workers include actively seeking out learning resources, such as online courses from platforms like Coursera or Udacity, which may offer AI-powered personalized learning paths. Hopefully this book will provide you, dear reader, with new ideas and strategies for navigating a turbulent but promising future.
Key Concepts: LLMs, Generative AI, Multimodality
Large Language Models (LLMs) have become foundational to modern AI applications. Trained on vast corpora of text, these models leverage transformer architectures to perform a wide range of natural language processing tasks, including text generation, summarization, translation, and question answering. Their ability to understand and generate human-like text has enabled applications such as chatbots, virtual assistants, and content creation tools. The scalability and adaptability of LLMs make them versatile tools in various domains, from customer service to education.
Generative AI encompasses a broader category of models capable of creating new content across different modalities, including text, images, audio, and video. While LLMs are a subset focused on text, generative AI also includes models like diffusion models for image generation and audio synthesis models. These models learn patterns from training data to produce novel outputs, enabling applications such as image creation, music composition, and video generation. The integration of generative capabilities across modalities expands the creative potential of AI systems.
Multimodal AI refers to systems that can process and integrate information from multiple data types, such as text, images, audio, and video. Multimodal LLMs (MLLMs) extend traditional LLMs by incorporating additional modalities, allowing for more comprehensive understanding and generation of content. For instance, models like Gemini can accept both text and image inputs, enabling tasks like image captioning, visual question answering, and multimodal dialogue. This capability enhances the contextual understanding and versatility of AI applications.
The development of multimodal generative AI has led to models capable of producing outputs across various modalities. These models are designed to handle inputs and outputs in any combination of text, images, audio, and video, facilitating complex cross-modal interactions. Such models are instrumental in creating more natural and human-like AI systems, capable of understanding and generating rich, multimodal content. This advancement opens up new possibilities in fields like virtual reality, human-computer interaction, and creative industries.
The convergence of LLMs, generative AI, and multimodality is reshaping the landscape of AI applications. By integrating capabilities across different data types, AI systems can achieve a more holistic understanding and generation of content, leading to more intuitive and effective interactions. This integration is pivotal in developing AI agents that can seamlessly operate across various tasks and environments, enhancing their utility in real-world applications.
As these technologies continue to evolve, they present both opportunities and challenges. The potential for more sophisticated and versatile AI applications is vast, but it also necessitates careful consideration of ethical, technical, and societal implications. Ensuring responsible development and deployment of these systems is crucial to harnessing their benefits while mitigating risks.
Google’s AI Ecosystem
This chapter serves as an introduction to tools we will use repeatedly in many examples.
The Google AI ecosystem offers a diverse range of tools, from user-friendly applications integrated into daily workflows to powerful cloud-based platforms and APIs. For the solo knowledge worker, navigating this landscape requires focusing on tools that are realistically accessible, useful for individual or very small team contexts, and considerate of potential budget and technical expertise limitations.
Differences Between AI Support in Paid For Google Workspace and Free Apps Like Gmail, Calendar, and Docs
The integration of AI capabilities, primarily through Gemini, differs somewhat between the paid Google Workspace platform and the suite of free Google applications (Gmail, Calendar, Docs, Drive, etc.) accessible with a standard Google account. The core distinction lies in the depth and nature of this integration, impacting user workflow and the scope of AI assistance.
Gemini’s integration is deeper in the commercial Workspace product because it functions as a co-pilot when dealing with workflows across Workspace apps. Gemini operates within the application you’re using. When drafting in Docs, it can help write, summarize, or change the tone based on the existing document content. In Gmail, it can draft replies aware of the email thread’s context. In Sheets, it can help generate formulas or analyze data present in the sheet.
For users of the free Google apps, Gemini integration operates differently. It’s not typically embedded directly within the free versions of Docs, Sheets, or standard Gmail in the same co-pilot manner. Instead, the interaction primarily occurs:
- Through the Gemini App/Web Interface: Users interact with Gemini via its dedicated mobile apps (Android/iOS) or the web interface (gemini.google.com).
- Via Extensions/Connections: To allow Gemini to access data from free apps like Gmail, Drive, or Calendar, users must explicitly enable specific “Extensions” or connections within the Gemini settings. This grants Gemini permission to query that data. (See figure below.)
- Querying and Summarization: The focus is more on using Gemini as a central hub to ask questions about or summarize information from your connected services. For example, you can ask Gemini to “summarize my recent emails from John Doe,” “find the document titled ‘Project Phoenix proposal’,” or “what’s on my calendar tomorrow?”.
- Less Integrated Workflow: While Gemini can access this data, the workflow is less seamless. You generally need to switch from your primary application (e.g., Gmail) to the Gemini interface to invoke its capabilities regarding that application’s data. While it might help draft a basic email from the Gemini interface, it doesn’t offer the same level of contextual, in-app writing assistance as the Workspace version.
As seen in the following figure, we need to explicitly allow Gemini access to data in other Google apps:

When I work for corporate customers I use their WorkSpace instantiation but I have never used it for an extended period of time for my own workflow. One feature I find very useful in WorkSpace is the Cloud Search application that searches across all your data in Google apps and Google Drive: very convenient!
Conversational AI Using the Gemini App and Web App
Gemini serves as a versatile conversational AI assistant, capable of understanding and generating human-like text, engaging in dialogue, summarizing information, translating languages, writing different kinds of creative content, and generating software code.
Configuring Gemini Advanced Web App
Note: I use Gemini Advanced but most examples in this book also work with the free version of Gemini.
After logging into https://gemini.google.com/ use the menu in the upper let hand corner (looks like three horizontal lines: the “hamburger” icon) to expand the menu and select Settings (last menu item at the bottom) and then elect the Apps submenu. This is shown in the last figure. You can now connect Gemini to the Google apps that you want to grant Gemini access to. For work I connect all Google Workspace apps (Gmail, Calendar, Docs, Drive, Keep, and Tasks). For fun and personal use I also connect Google Flights and Hotels, as well as Google Maps, YouTube, and YouTube Music (which I use). Optionally connect to OpenStax to get access to licensed textbooks.
Ask Gemini Chat
I use Gemini Chat fairly often interchangeably with OpenAI’s ChatGPT and Anthropic’s Claude with one life hack: I usually only pay for one service at a time. As I write this in May 2025, I subscribe to Gemini Advanced and use Claude and ChatGPT in the free mode. Assuming that you subscribe to Gemini Advanced here are the current models to choose from:

I usually select Gemini Flash 2.0 for general use because it is fastest and uses less resources (if you care about energy efficiency and the environment. If you want good background on the environmental costs of AI then I recommend reading Kate Crawford’s book Atlas of AI.)
You have four easily used options in the chat input as seen in this figure:

The options are:
- + - hit the plus sign to add files to your current context window. For example, if I have a PDF file for a textbook, I will import the book’s PDF before asking questions about the content of the book. You can add several context files.
- Deep Research - useful when you want Gemini to perform a thorough web search, choose which search hits are useful, and add the search results to the context before spending reasoning time answering your question or prompt.
- Canvas - is useful for creating documents and software code that can later be downloaded to your computer.
- Microphone - hit the icon that looks like a microphone to enter prompt text with voice input. If you are in a private work place, this is the option I recommend starting with, then hand edit dictated prompt text.
Using the macOS Gemini App
If you run on Macs then I recommend that you download the Gemini App. Here is a screenshot:

Functionally this app is equivalent to running the web app in the Chrome or Safari web browsers.
Advice On Writing LLM Prompts
Google’s Gemini represents a family of sophisticated large language models (LLMs) engineered with powerful multimodal capabilities, capable of processing and understanding not just text, but also image, audio, and video inputs. You can drop image files, text, files containing software, and PDFs into the Gemini chat window and this data supplies context for any text chats and prompts you then manually enter.
You engage with advanced AI systems like Gemini through various interfaces, including web applications and mobile apps, primarily by providing ‘prompts’. These prompts, which can range from simple text questions to complex instructions involving uploaded files or even spoken commands, serve as the fundamental mechanism for directing Gemini’s behavior and eliciting specific responses, guiding the AI to perform tasks like generating text, analyzing data, creating code, or answering questions based on the provided input.
The practice of crafting these inputs effectively is known as prompt engineering or prompt writing. It is often described as both an art and a science, requiring a blend of creativity, linguistic precision, and an understanding of how LLMs like Gemini interpret instructions and generate outputs. At its core, prompt writing is the skill of communicating intent clearly to the AI, guiding it to leverage its vast knowledge and capabilities in the desired direction while navigating its inherent limitations. It bridges the gap between human intention and the AI’s operational logic, enabling more controlled and predictable interactions.
Effective prompt writing hinges on several key principles designed to maximize the clarity and relevance of the AI’s response. Central among these are clarity and specificity, ensuring instructions are unambiguous and detailed enough for the model to understand the exact task. Providing sufficient context is also crucial, giving Gemini the necessary background information to frame its response appropriately. Furthermore, defining the desired output format (e.g., bullet points, email, code block, or Markdown) and tone (e.g., formal, conversational, humorous) helps shape the final result to meet specific needs. Techniques like including examples (few-shot prompting) can further refine the output. Examples are important; for example, when asking Gemini to pull structured data from input text to formats like JSON it is useful to provide an example JSON output.
Mastering the art of prompt writing is paramount for unlocking the full potential and utility of powerful AI models like Gemini. Well-crafted prompts lead directly to higher-quality, more accurate, and significantly more useful responses, minimizing generic or irrelevant outputs. This skill translates into tangible benefits such as enhanced productivity, accelerated workflows, improved decision-making, and the ability to leverage Gemini for more complex creative and analytical tasks. Ultimately, skillful prompt writing transforms the interaction with Gemini from simple querying into a powerful collaboration, allowing users to harness its advanced capabilities more effectively and reliably.
Creating Gemini Gems
Gemini Gems are essentially customized versions of the Gemini AI assistant that you can create and save for specific purposes. You create Gems using the Gem manager menu option. The Gem manager screen shows Premade by Google gems that I suggest you explore for ideas. After you create your own Gems, they appear at the bottom of the Gem manager screen. We will create a new gem after some background:
Think of Gems as specialized “experts” or focused tools built upon the core Gemini model. You define a Gem by providing it with a specific set of instructions, outlining its persona, the task it should perform, the context it needs to consider, and potentially the format for its responses. You can further enhance a Gem’s capabilities by providing it with knowledge files, such as uploaded documents or files from your Google Drive, which it can reference to inform its answers and maintain consistency with your specific information or guidelines.
The primary reasons for creating and using Gems revolve around efficiency, personalization, and consistency. By saving detailed instructions within a Gem, you avoid repeatedly typing complex prompts for recurring tasks, saving significant time and effort. This allows you to tailor Gemini’s behavior precisely to your needs – whether you require a specific writing tone, adherence to particular coding standards, consideration of dietary restrictions for recipes, or knowledge about a niche topic. Using Gems ensures that Gemini consistently follows your predefined guidelines or consults specific documents every time you interact with that particular Gem, leading to more reliable and relevant outputs for specialized tasks.
Examples of Gems are quite varied, reflecting the flexibility of the concept. You could create a “Coding Partner” Gem instructed on your preferred programming languages and project context, potentially referencing specific code libraries. A “Marketing Copywriter” Gem could be built using your company’s brand voice guidelines and target audience details stored in uploaded documents. For personal use, you might design a “Meal Planner” Gem that knows your dietary preferences and budget, a “Workout Generator” Gem aware of your fitness goals and limitations, or a “Brainstorming Assistant” Gem configured to encourage creative, out-of-the-box thinking for your specific projects or hobbies.
Creating a Gem for Suggesting Recipes Given the Context of Some Favorite Recipes
I will run through this example step by step. Click on the + New Gem button. Here I create a new Gem by dropping three JSON recipe files from my CookingSpace.com project into the Knowledge input form at the bottom of the screen, name the gem, and provide instructions.
Writing new Gems is simple, so if you repeatedly run similar prompts, then by all means customize and personalize Gemini by creating your own Gems!
Here are a few screenshots showing creating and use of a new Gem:

OK, we have a new Gem, let’s use it:

Here is a slightly edited (for brevity) output for the prompt:
1 I have salmon, leftover brown rice, some broccoli and spinach. Make up a new Asian s\
2 tyle stir fry recipe for me.
Here is the generated recipe:
1 Here's a recipe idea for Quick Salmon & Veggie Stir-Fried Rice:
2
3 Yields: 1-2 servings
4 Prep time: 10 minutes
5 Cook time: 10-15 minutes
6
7 Ingredients:
8
9 - Salmon: 1 fillet (about 6 oz), skin removed and cut into 1-inch cubes
10 - Broccoli: 1 cup, cut into small florets (similar to Salmon Rice or Lemon Sesame Br\
11 occoli)
12 - Spinach: 1-2 cups, fresh
13 - Leftover Cooked Brown Rice: 1.5 - 2 cups (like in Salmon Rice or Simple Shrimp Fri\
14 ed Rice)
15 - Garlic: 1-2 cloves, minced (inspired by multiple recipes like Spicy Shrimp with Ve\
16 getables and Rice)
17 - Ginger: 1 teaspoon, freshly grated (optional, inspired by Simple Shrimp Fried Rice)
18 - Soy Sauce: 2 tablespoons (a common ingredient in Asian-style recipes provided)
19 - Sesame Oil: 1 tablespoon (used in Salmon Rice, Honey Chinese Shrimp, Simple Shrimp\
20 Fried Rice)
21 - Honey or Brown Sugar: 1 teaspoon (optional, for a touch of sweetness like in Honey\
22 Chinese Shrimp)
23 - Oil for cooking: 1-2 tablespoons (e.g., vegetable, canola, or olive oil, as seen i\
24 n various recipes)
25 - Optional: Red pepper flakes (like in Spicy Shrimp with Vegetables and Rice or Mari\
26 nara Sauce), sliced green onions for garnish.
27
28 Directions:
29
30 Prepare the Sauce: In a small bowl, whisk together the soy sauce, sesame oil, and op\
31 tional honey/brown sugar and red pepper flakes. Set aside.
32
33 Sauté Aromatics & Broccoli: Heat 1 tablespoon of cooking oil in a large skillet or w\
34 ok over medium-high heat. Add the minced garlic and grated ginger (if using) and sti
35 r-fry for about 30 seconds until fragrant. Add the broccoli florets and stir-fry for
36 3-4 minutes until crisp-tender.
37
38 Cook Salmon: Push the broccoli to one side of the skillet. Add the cubed salmon to t\
39 he empty side. Cook for 2-3 minutes per side, until opaque and cooked through. Try n
40 ot to break the pieces up too much.
41
42 Add Rice & Spinach: Add the leftover brown rice to the skillet, breaking up any clum\
43 ps. Stir-fry for 2-3 minutes until heated through. Add the spinach and stir until it
44 begins to wilt (about 1-2 minutes).
45
46 Combine & Sauce: Gently toss the salmon, broccoli, rice, and spinach together. Pour \
47 the prepared sauce over everything and toss gently to coat evenly. Cook for another
48 minute.
49
50 Serve: Serve immediately, garnished with sliced green onions if desired.
51
52 Enjoy your quick and healthy Asian-style stir fry!
It took me one minute to create this Gem and it is functional for generating new recipes.
Once Gems are written, try using them and then keep improving them! After I wrote this section I went back to edit my Recipe Gem, adding all eight JSON recipe files from my old recipes project so it has a broader context to suggest new and novel recipes that I might like.
Ideas For Your Own Gems
What, dear reader, do you do for a living and what are your hobbies? I suggest that you start with a fresh sheet of paper (or an empty note file on your computer) and write down three or four of your interests. For each or your interests, write down what data you have for each interest. It is better if the data is something that you produced yourself so the Gems that you create will generate text, audio, images, and videos in your own personal style.
Let’s look at an example use case:
Most of the books that I write are tech-heavy with many programming examples. I very much enjoy both programming and writing but there is one task that I do not enjoy: when I place a software listing in one of my books I need to add several paragraphs describing the code, which libraries I used in the code, etc. This is tedious so I automated this process two years ago with a custom ChatGPT and more recently I automated it once again using Gemini: I collected many examples in my older books of my program listings followed by the explanatory text I wrote myself. Now I can use my Gemini Gem called “Mark Watson Writing Assistant” and drop in any new program listing I have created and I get several paragraphs of explanatory text that I can edit and tweak, and then insert into my manuscript.
I believe that the more effort you put into personalizing AI tools, the more benefit you get.
Google NotebookLM
When you, dear reader, have many source documents (text, PDFs, spreadsheets, videos, etc.) for a new or existing project and you need to understand this material with the minimum amount of your time and effort, import your source documents into a new NotebookLM notebook and as you browse or you carefully read source material you can ask questions about how the material relates to other material in the source documents, express what your current interest is and ask for a recommendation of the most critical documents, you can ask for summaries, etc.
In a loose sense, NotebookLM notebooks in a similar way as custom Gemini Gems, with a few differences:
- I use NotebookLM notebooks when I want to study 3rd party technology (i.e., something that I did not create myself) to collect relevant documents and web sites to study in one place, with a Gemini chat interface for “taking about” this content.
- I use NotebookLM notebooks when I am starting a new writing or software development project to organize my thoughts and to “brainstorm” with Gemini, given as context the material in a notebook.
I tend to remove old notebooks when I am done with them. I find this to be analogous to how twenty years ago I used to organize new writing and also software projects with handwritten notes on yellow pads and spread the sheets of paper around. I would clean up the paper artifacts when a project is done, and I also like to keep my NotebookLM work environment tidy. Here is a screenshot of the notebooks that I am currently using:

Using the NotebookLM web app is straightforward and I know you will find it useful.
Google AI Studio
Before starting to read this section, dear reader, please open the Google AI Studio web app and login with your Google (or Gmail) account. Google AI Studio is tailored to workflows for developers:
The fundamental distinction between the Google AI Studio (aistudio.google.com) and the Gemini web application (gemini.google.com) lies in their target audience and primary purpose: AI Studio is a web-based developer tool focused on prototyping and experimenting with Google’s generative models via the Gemini APIs, whereas the Gemini web app is a direct-to-consumer application designed for general users to interact conversationally with the Gemini models for assistance, content generation, and information retrieval. While both interfaces ultimately leverage Google’s powerful AI models like Gemini, AI Studio provides a structured environment specifically for developers and builders to craft, test, and refine prompts, adjust model parameters (like temperature, top-k, top-p), compare outputs from different prompt variations, and ultimately generate API keys and corresponding code snippets (in languages like Python, JavaScript, curl, etc.) to integrate the model’s capabilities into their own applications or services. In contrast, the Gemini web app offers a more streamlined, conversational chat interface optimized for direct interaction, task completion, and creative exploration by end-users, without exposing the underlying API controls, parameter tuning, or explicit code generation features found in AI Studio; its focus is on the experience of using AI, while AI Studio’s focus is on the process of building with the AI via its API.
Here is a screenshot of the Google AI Studio web app:

This screenshot shows what you will see the first time you open the web app with one exception: on the lower left corner of the app window under History you see a few of my existing projects like “Hy playground”, Clojure Research agent, Prolog and LLMs, etc.
Google AI Studio serves as an accessible gateway to Google’s powerful generative artificial intelligence models, most notably the Gemini family. It’s a web-based platform designed for rapid experimentation and prototyping with AI. Whether you’re looking to understand what modern AI can do, build a proof-of-concept for a new application, or simply explore creative possibilities, AI Studio provides an interactive environment to directly engage with sophisticated AI capabilities without complex setup requirements.
For developers, AI Studio is an invaluable tool for quickly iterating on prompts and tuning model parameters like temperature or top-k to achieve desired outputs before integration. You can craft and refine prompts for various tasks, test different model versions, and seamlessly generate API keys to embed the power of Gemini models directly into your own applications and workflows. This significantly accelerates the development cycle for AI-powered features, allowing for faster testing and deployment.
For marketing professionals, small business owners, and other non-technical users, AI Studio demystifies generative AI by providing an intuitive interface to explore its potential. You can experiment with generating creative text formats, brainstorming ideas, summarizing information, drafting communications, or even analyzing images, all through simple prompt interactions. This hands-on experience allows users to discover practical applications for AI within their specific business context or creative endeavors, fostering innovation without needing to write a single line of code.
The last section of this book contains Python programs using the Gemini APIs. When writing Python applications using Gemini you can save time prototyping prompts and API parameters using the AI Studio.
Gemini Integration with Gmail, Google Docs, Google Sheets and Google Meet, Google Drive, Google Calendar
In the last section of this book we will look at short Python programs that utilize the Gemini APIs accessing Google WorkSpace apps like Gmail, etc. Here we show examples of interacting with the WorkSpace web apps and Gemini.
Combining Gemini in Gmail with Google Docs, Drive, and Calendar
As I write this in May 2025, Gemini in Gmail is tightly integrated with your document files in Google Docs and also supports:
- Google Drive: Gemini can find and summarize information from your files stored in Google Drive directly within Gmail.
- Google Calendar: You can ask Gemini in Gmail to find information about events on your primary calendar or even create new calendar events.
Currently there is some support for:
- Google Sheets & Slides: While the core integration is strong in Docs, Gemini capabilities extend to Sheets (like analyzing data) and Slides (like creating presentations), often accessible via the Gemini side panel within those apps or the main Gemini interface connected to your Workspace account. Information from these might be referenced or summarized within Gmail.
- Google Meet: Gemini can help with meeting-related tasks like taking notes or summarizing, and this information might be accessible or summarized through Gmail.
A few days ago I received a request for free mentoring and while I was reading this person’s email I activated Gemini inside Gmail and prompted “Find my @doc Google Docs that might be useful for answering this person’s questions.”
This was useful because it surfaced two notes in Google Docs that were relevant to the mentoring request email. I didn’t want to use this person’s email as a book example so I sent myself a similar question, opened the email in Gmail and activated Gemini. Again, I used the prompt “Find my @doc Google Docs that might be useful for answering this person’s questions.” And show the screenshot of this here:

Combining Gemini with Google Calendar and Gmail
Gemini’s integration with Gmail and Google Calendar enhances productivity by allowing you to summarize email threads, draft contextual replies, and extract action items directly within Gmail; crucially, it bridges communication to scheduling by identifying potential meeting details in emails and helping you create corresponding Google Calendar events, intelligently checking your availability—including across shared calendars like (like my wife Carol’s and my calendars).
The following example uses the same test email but now the prompt is “Please find Calendar entries for tasks relating to this email” and here is the calendar entries it found:

Using Gemini in a Google Doc Specifying a Design Document for a New Software Project
Integrating Gemini directly within Google Docs, typically via Gemini for Workspace, embeds an AI writing assistant into your workflow; accessible through a side panel or contextual menus (“Help me write”), it allows you to generate drafts, outlines, or summaries, rewrite selected text to adjust tone, length, or formality, brainstorm ideas, and even create images based on prompts, all without leaving the document interface, thus streamlining content creation and refinement.
Gemini can also use the contents of the current document as prompt context. Here I opened an old design document in Google Docs that I wrote many years ago and after activating Gemini, I prompted “Write the software specified in this design document in the Haskell programming language” and Gemini used the text in the document to write an initial Haskell implementation:

Using Gemini with Google Sheets
My favorite use of Gemini with Google Sheets spreadsheets is creating new test data sets for machine learning. Here I use a prompt to create a spreadsheet with three columns:
1 Search the web to make a spreadsheet of dow industrial average and S&P for every day\
2 in March 2025
Gemini searched the web and created this table.

I used the File -> Download -> Comma Separated Values (CSV) menu to save the file on my laptop.
Wrap Up for Gemini Integration with Google WorkSpace Apps
I have been experimenting with Gemini integration in Google apps since the summer of 2024 and tool features and capabilities have greatly improved. As I write this in May 2025 I now use Gemini integration with Google apps as a daily tool.
Part 2 - Using Python Scripts
Much of the inspiration for the following Python examples comes from Google’s documentation and specifically their Gemini Cookbook.
Python Introduction and Tutorial
Here is a Python tutorial designed for intelligent individuals new to programming, focusing on the core concepts needed to understand the example scripts later in this book.
Follow one of these links to set up Python on your computer:
A Whirlwind Tour of Python for the Curious Mind
Welcome to the world of Python! Think of Python not just as a programming language, but as a precise way to give instructions to a computer. Its popularity stems from its readability – Python code often looks surprisingly like plain English. This makes it an excellent starting point for programming and particularly well-suited for interacting with complex systems like AI models. At its heart, programming is about manipulating data and controlling the flow of execution, and Python provides elegant tools for both.
Let’s start with the absolute basics: variables. A variable is simply a named container for storing information. You might have a variable called prompt
to hold the text you want to send to Gemini, or temperature_setting
to store a numerical value controlling the AI’s creativity. You create a variable by assigning a value to a name using the equals sign (=
), like my_name = "Alex"
. Here, my_name
is the variable, and it now holds the text string "Alex"
. Python automatically figures out the type of data – text (called a string), whole numbers (integers), decimal numbers (floats), or true/false values (Booleans).
Often, you’ll work with collections of data. Two fundamental ways Python organizes data are lists and dictionaries. A list is an ordered sequence of items, enclosed in square brackets []
. For example, prompts = ["What is the capital of France?", "Summarize the theory of relativity."]
creates a list containing two strings. You can access items by their position (starting from 0). A dictionary, enclosed in curly braces {}
, stores key-value pairs. Think of it like a real dictionary: you look up a word (the key) to find its definition (the value). For instance, config = {"model": "gemini-pro", "max_tokens": 500}
stores configuration settings. You access values using their keys, like config["model"]
.
Computers excel at repetition and decision-making. Python uses control flow statements for this. The if
statement lets your program make decisions. You provide a condition, and if it’s true, a block of code runs: if user_age >= 18: print("Access granted")
. You can add an else
block for what to do if the condition is false. For repetition, the for
loop is invaluable. It iterates over a sequence (like a list). For example, for p in prompts: print(p)
would print each prompt stored in our prompts
list, one by one.
As programs grow, repeating the same sequence of instructions becomes tedious and error-prone. Functions solve this by letting you package a block of code under a specific name. You define a function using def
, give it a name, specify any inputs (parameters or arguments) it needs, and write the code it should execute. Later, you can call the function by its name whenever you need that task performed, potentially passing in different arguments each time. Functions can also return a result. For instance, you might define a function send_to_gemini(prompt_text)
that takes a prompt, interacts with the API, and returns the AI’s response.
Python’s power is greatly extended by modules (sometimes called libraries or packages). These are collections of pre-written code, often providing specialized functionality. Your scripts interacting with Gemini will rely heavily on Google’s specific Gemini library. To use code from a module, you first need to bring it into your script using the import
statement. For example, import google.generativeai as genai
makes the functions and objects defined within the google.generativeai
module available for use in your code, often giving it a shorter alias (genai
in this case) for convenience.
So, how does this all fit together when interacting with something like the Gemini API? Typically, a script might start with import
statements to bring in necessary libraries (like the Gemini library). It might then define variables to store things like your API key or configuration settings (perhaps using a dictionary). You might define one or more functions, for instance, one to format your prompt and another to handle the actual communication with the API, taking the prompt as input and returning the response. The main part of your script could then use a loop (maybe a for
loop) to iterate through a list of prompts, calling your functions for each one, and perhaps using if
statements to check the responses or handle potential errors.
When you look at the Python example scripts later in this book, don’t feel pressured to understand every single character immediately. Instead, focus on identifying these core building blocks: Where are variables being defined? What data (strings, lists, dictionaries) are they holding? Can you spot the loops (for
) and decisions (if
)? Which parts are reusable functions (def
), and where are they being called? Which external libraries are being brought in (import
)? Recognizing these patterns is the key to deciphering the script’s purpose and logic. Python’s readability is your ally here; often, the variable and function names themselves will give you strong clues about what the code is trying to achieve. Programming is fundamentally about breaking down problems into logical steps, and Python provides a clear syntax for expressing those steps.
Wrap Up for Introducing Python
Dear reader, this short chapter is not sufficient to introduce you to Python programming but it will enable you to understand the Python examples in the following chapters. Furthermore, the Gemini mobile apps or web interface can be used to guide you through the use of Python.
First Steps Using Python with the Gemini Python Library
Introduction
Welcome to your first practical steps in harnessing the power of Google’s Gemini models using Python. The Gemini API provides access to Google’s state-of-the-art large language models, enabling you to integrate generative AI capabilities into your applications. To facilitate this interaction within a Python environment, Google provides the google-generativeai
library, often referred to as the Gemini Python SDK (Software Development Kit).
What is an SDK? An SDK is a collection of software development tools in one installable package. They ease the creation of applications by providing compilers, debuggers, and often a software framework. In our context, the Gemini Python SDK simplifies interaction with the Gemini API by handling the complexities of HTTP requests, responses, authentication, and data formatting, allowing you to focus on what you want to achieve with the model rather than the low-level communication details.
This chapter focuses on the crucial first step: setting up your environment, installing the necessary library, and authenticating your requests to the Gemini API. Authentication verifies your identity and authorizes your application to use the API, often tying usage to your Google Cloud project for billing and quota management.
Prerequisites
Before you begin coding, ensure you have the following:
- Python Installed: You’ll need a compatible version of Python installed on your system (typically Python 3.9 or newer is recommended for modern libraries). You can check your version using
python --version
orpython3 --version
in your terminal. -
Google API Key: You need an API key associated with a Google Cloud project where the “Generative Language API” (or potentially newer related APIs like Vertex AI) is enabled.
- Navigate to the Google Cloud Console.
- Create a new project or select an existing one.
- Ensure the necessary API (e.g., Generative Language API) is enabled for your project.
- Go to the “Credentials” section and create an API key.
- Important: Treat your API key like a password. Keep it secure and do not embed it directly in your source code or commit it to version control.
Installation and the requirements.txt
File
To use the Gemini Python SDK, you first need to install it. Python projects commonly manage dependencies using a file named requirements.txt
.
What is requirements.txt
?
This file lists all the external Python packages that your project depends on, along with optional version specifiers. Using requirements.txt
ensures that anyone working on the project (or any environment where the project is deployed) can install the exact same set of dependencies easily. This reproducibility is crucial for avoiding compatibility issues.
For basic Gemini API interaction, your requirements.txt
file needs to contain at least the SDK library:
1 # requirements.txt
2 google-generativeai
Installation Steps:
- Create a file named
requirements.txt
in your project directory. - Add the line
google-generativeai
to this file. -
It’s highly recommended to use a Python virtual environment (
venv
) to isolate your project’s dependencies. Create and activate one:1 python -m venv myenv # Create the virtual environment (e.g., named 'myenv') 2 source myenv/bin/activate # Activate on Linux/macOS 3 # or 4 .\myenv\Scripts\activate # Activate on Windows
-
Install the dependencies listed in your
requirements.txt
file usingpip
, Python’s package installer:1 pip install -r requirements.txt
Pip will read the file and download/install thegoogle-generativeai
library and any packages it depends on.
Authentication: Connecting Securely
Authentication is how the Gemini API verifies that your request is legitimate and associated with your account. The primary method used by the SDK, especially for development and server-side applications, is an API key.
Best Practice: Secure API Key Management As mentioned, never hardcode your API key directly into your source code. This is a significant security risk. If your code is ever exposed (e.g., committed to a public repository), your key could be stolen and used maliciously, potentially incurring costs on your account.
The recommended approach is to store your API key in an environment variable. An environment variable is a variable stored outside your program, within the operating system’s environment. Your code can then read this variable at runtime.
Setting the Environment Variable:
-
Linux/macOS:
1 export GOOGLE_API_KEY='YOUR_API_KEY_HERE'
(Add this line to your shell profile like~/.bashrc
or~/.zshrc
for persistence across sessions). -
Windows (Command Prompt):
1 set GOOGLE_API_KEY=YOUR_API_KEY_HERE
-
Windows (PowerShell):
1 $env:GOOGLE_API_KEY='YOUR_API_KEY_HERE'
- Using
.env
files: For project-specific variables, you can use a.env
file in your project root and thepython-dotenv
library to load them. Install it (pip install python-dotenv
) and load it at the start of your script.
Example 1: auth_test1.py
- Basic Authentication & Setup
This script demonstrates the fundamental step of configuring the Gemini SDK with your API key retrieved from the environment.
1 # --- Example 1: Basic Authentication & Setup ---
2 # Purpose: Configure the Gemini API client using an API key
3 # stored securely in an environment variable.
4
5 import google.generativeai as genai
6 import os
7 import sys # sys is imported but not used in this specific snippet
8
9 # Best practice: Store your API key in an environment variable
10 # (e.g., GOOGLE_API_KEY) rather than hardcoding it.
11 # You can set this in your system or using a .env file with python-dotenv.
12 try:
13 # 1. Retrieve the API key from the environment variable
14 api_key = os.getenv('GOOGLE_API_KEY')
15 if not api_key:
16 # Raise KeyError if the variable is not set or is empty
17 raise KeyError("GOOGLE_API_KEY environment variable not found or is empty.")
18
19 # 2. Configure the SDK with the retrieved API key
20 genai.configure(api_key=api_key)
21
22 # 3. Verify configuration by listing available models (optional but recommended)
23 # This makes a simple API call to confirm authentication works.
24 print("\nAvailable models supporting content generation:")
25 for m in genai.list_models():
26 # Filter for models that can actually generate text/content
27 if 'generateContent' in m.supported_generation_methods:
28 print(m.name)
29
30 # 4. Handle potential errors during configuration
31 except KeyError as e:
32 # Specific error for missing environment variable
33 print(f"Error: {e}")
34 print("Please set the GOOGLE_API_KEY environment variable with your API key.")
35 # Consider adding instructions here on *how* to set it based on OS
36 except Exception as e:
37 # Catch any other exceptions during the genai.configure() or list_models() calls
38 print(f"An error occurred during configuration or model listing: {e}")
Explanation:
- Import necessary libraries:
google.generativeai
(the SDK) andos
(to access environment variables). - Retrieve API Key:
os.getenv('GOOGLE_API_KEY')
attempts to read the value of the environment variable namedGOOGLE_API_KEY
. We add a check to ensure it’s notNone
or empty. - Configure SDK:
genai.configure(api_key=api_key)
is the core function call. It initializes the SDK globally within your script’s context, setting up the necessary credentials for subsequent API calls. - Verify Configuration (Optional but useful):
genai.list_models()
makes a lightweight call to the API to retrieve a list of available models. Successfully getting this list back confirms that your API key is valid and the SDK is configured correctly. We filter this list to show only models capable of content generation ('generateContent' in m.supported_generation_methods
). -
Error Handling: The
try...except
block gracefully handles potential issues:KeyError
: Catches the specific error if theGOOGLE_API_KEY
environment variable isn’t set.Exception
: Catches other potential errors during the configuration or thelist_models
call (e.g., network issues, invalid API key).
Running the Script:
Save the code as auth_test1.py
. Ensure your GOOGLE_API_KEY
environment variable is set, then run the script from your terminal:
1 python auth_test1.py
If successful, you should see a list of model names printed to the console. If not, the error messages should guide you.
Making Your First API Call: Text Generation
Once the SDK is configured via genai.configure()
, you can start using the models. Let’s look at a simple text generation example.
Example 2: text_generation.py
This script builds upon the authentication setup to send a prompt to a Gemini model and receive a generated response.
1 # --- Example 2: Text Generation ---
2 # Purpose: Use the configured Gemini client to generate text
3 # based on a given prompt.
4
5 import google.generativeai as genai
6 import os
7 import sys # sys is imported but not used in this specific snippet
8
9 # Best practice: Store your API key in an environment variable
10 # (e.g., GOOGLE_API_KEY) rather than hardcoding it.
11 try:
12 # 1. Configure the SDK (Required in every script/session using the API)
13 api_key = os.getenv('GOOGLE_API_KEY')
14 if not api_key:
15 raise KeyError("GOOGLE_API_KEY environment variable not found or is empty.")
16 genai.configure(api_key=api_key)
17
18 # 2. Select the model
19 # Use a specific model name from the list obtained in Example 1.
20 # 'gemini-1.5-flash' is often a good balance of speed and capability.
21 # Note: The user example specified 'gemini-2.0-flash', ensure this model is avai\
22 lable.
23 model = genai.GenerativeModel('gemini-1.5-flash') # Or 'gemini-pro', 'gemini-1.0\
24 -pro' etc.
25
26 # 3. Define the prompt (Your instruction to the model)
27 prompt = "Brainstorm 5 blog post ideas about remote work productivity for solo k\
28 nowledge workers."
29
30 # 4. Generate content by calling the model
31 # This makes the actual API request to the Gemini service.
32 response = model.generate_content(prompt)
33
34 # 5. Process and print the response
35 print("\n--- Blog Post Ideas ---")
36 # Check if the response contains parts (text content)
37 # Responses might be empty if blocked due to safety settings or other issues.
38 if response.parts:
39 print(response.text) # Access the generated text
40 else:
41 # Provide feedback if the response was empty or blocked
42 print("Response was empty or potentially blocked.")
43 # response.prompt_feedback often contains reasons for blocking (e.g., safety\
44 ratings)
45 print(f"Safety feedback: {response.prompt_feedback}")
46
47
48 # 6. Handle potential errors during generation
49 except KeyError as e:
50 # Handle missing API key specifically
51 print(f"Error: {e}")
52 print("Please set the GOOGLE_API_KEY environment variable.")
53 except AttributeError:
54 # This might occur if genai.configure was not called successfully before model u\
55 sage
56 print("Error: Gemini API likely not configured. Ensure GOOGLE_API_KEY is set and\
57 valid.")
58 except Exception as e:
59 # Catch other errors during model instantiation or generation
60 print(f"An error occurred during text generation: {e}")
61 # This could include API errors, network issues, invalid model names, etc.
Explanation:
- Configure SDK: Notice that
genai.configure()
must be called again here. Configuration typically needs to happen once per script execution or application session that intends to use the API. - Select Model:
genai.GenerativeModel('model-name')
creates an instance of the model you want to interact with. You should use one of the model names listed by theauth_test1.py
script (e.g.,gemini-1.5-flash
,gemini-pro
). - Define Prompt: This is the input text you provide to the model.
- Generate Content:
model.generate_content(prompt)
sends the prompt to the specified Gemini model via the API and waits for the response. -
Process Response: The result is a
GenerateContentResponse
object.response.text
provides the primary generated text content if available.- It’s crucial to check if the response contains content (
if response.parts:
). Sometimes, a response might be blocked due to safety filters or other reasons. response.prompt_feedback
can provide information about why content might have been blocked (e.g., safety ratings).
- Error Handling: Includes checks for
KeyError
(missing API key),AttributeError
(ifgenai
wasn’t configured properly before trying to useGenerativeModel
), and generalException
s for API or network issues during the generation call.
Running the Script:
Save the code as text_generation.py
. Ensure your GOOGLE_API_KEY
is set, then run:
1 python text_generation.py
You should see the brainstormed blog post ideas printed to your console.
Wrap Up
You have now successfully set up your Python environment, installed the Gemini SDK using pip
and requirements.txt
, and learned the fundamental process of authenticating with the Gemini API using an API key stored securely in an environment variable. You’ve also run your first text generation query.
Authentication is the gateway to using the API. By following best practices like using environment variables, you ensure your credentials remain secure while enabling your Python applications to leverage the power of Google’s Gemini models. In the following chapters, we will explore more advanced features of the API and the Python SDK.
Gemini Thinking Mode
Introduction
In the previous chapter, we covered the fundamentals of authenticating with the Gemini API and performing basic text generation using the Python SDK. Those examples focused on getting a direct response to a prompt. However, some tasks benefit from the model having more allocated time or computational resources to “think” or plan before generating the final output. This is where Gemini Thinking Mode comes into play.
Thinking Mode is an advanced feature designed for complex prompts that might involve multi-step reasoning, intricate analysis, or detailed planning. By enabling Thinking Mode, you instruct the API to dedicate a specific budget of internal processing time or compute units towards understanding and structuring its response before it begins generating the text you see. This can lead to higher-quality, more coherent, and better-reasoned outputs for challenging tasks.
When to Use Thinking Mode
While standard generation is fast and efficient for many prompts, Thinking Mode is particularly beneficial when:
- Complex Problem Solving: The prompt requires breaking down a problem into steps, evaluating different approaches, or synthesizing information from various angles (e.g., “Develop a phased marketing strategy for launching a new SaaS product targeting small businesses”).
- Detailed Analysis or Summarization: You need the model to deeply analyze a complex piece of text or data and provide a nuanced summary or critique.
- Creative Tasks Requiring Planning: Generating elaborate stories, complex code structures, or detailed project plans where upfront structuring improves the final result.
- Multi-Step Instructions: The prompt contains several dependent instructions that the model needs to process sequentially or holistically before responding.
For simple prompts (e.g., “What is the capital of France?”), Thinking Mode is likely unnecessary and might only add latency without significantly improving the output. It’s a tool best reserved for computationally intensive generative tasks.
Enabling Thinking Mode with the Python SDK
Activating Thinking Mode involves configuring specific parameters within the generate_content
method call, using helper objects from the google.genai.types
module.
types.GenerateContentConfig
: This object acts as a container for various advanced generation settings, including safety settings, stop sequences, and, crucially for us, the thinking configuration.types.ThinkingConfig
: This object specifically controls the Thinking Mode feature.thinking_budget
: This parameter withinThinkingConfig
specifies the amount of computational resources or time allocated for the model’s “thinking” phase. The exact unit is an internal measure, but a higher value generally allows the model more pre-computation time, potentially leading to better results on complex prompts, possibly at the cost of increased latency. You may need to experiment with this value based on the complexity of your prompts and desired output quality.
Alternative Client Initialization:
Note that the example below uses genai.Client(api_key=...)
to initialize the connection. This is an alternative to the genai.configure(api_key=...)
method used in the previous chapter.
genai.configure()
sets up a default global client.genai.Client()
creates an explicit client instance. This can be useful if you need to manage multiple clients with different settings or prefer explicit object management over global configuration. Both methods achieve the goal of authenticating your requests using the API key.
Example: demo_1.py
- Advertising Strategy Brainstorm with Thinking Mode
This script demonstrates how to request advertising ideas for a new consulting company, specifically enabling Thinking Mode to potentially generate more structured or well-reasoned options.
1 from google import genai
2 from google.genai import types
3 import os
4
5 client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
6
7 MODEL_ID="gemini-2.5-flash-preview-04-17"
8
9 prompt = """
10 I am starting a AI/LLM consulting company. Suggest
11 three different options for advertising services.
12 """
13
14 response = client.models.generate_content(
15 model=MODEL_ID,
16 contents=prompt,
17 config=types.GenerateContentConfig(
18 thinking_config=types.ThinkingConfig(
19 thinking_budget=500
20 )
21 )
22 )
23
24 print(response.text)
Explanation:
- Imports: Includes
google.genai
,google.genai.types
(specifically forGenerateContentConfig
andThinkingConfig
), andos
. - Client Initialization:
genai.Client()
creates the client instance, authenticating using theGOOGLE_API_KEY
environment variable. - Model ID: Specifies the model to use (
gemini-1.5-flash-preview-04-17
as per the user’s example). Note that using preview models means behavior might change in the future. - Prompt: A moderately complex prompt asking for structured suggestions, suitable for Thinking Mode.
-
generate_content
Call:model
andcontents
are specified as usual.config=types.GenerateContentConfig(...)
: This argument is used to pass advanced settings. (Correction based on user code: The user’s example directly used aconfig
parameter holding aGenerateContentConfig
which in turn held theThinkingConfig
. While nesting undergeneration_config
orrequest_options
is common in newer/other SDK methods, we follow the user’s example structure here).thinking_config=types.ThinkingConfig(...)
: Inside theGenerateContentConfig
, we create aThinkingConfig
instance.thinking_budget=500
: This assigns the budget for the thinking phase. The value500
is relative; experimentation might be needed to find optimal values for different tasks.
- Output: The script prints the
response.text
, which contains the model’s generated advertising options. We include basic checks for empty/blocked responses.
This example generates a report that is several pages in length; here is the beginning of the output:
1 Okay, starting an AI/LLM consulting company is exciting! Since it's a B2B service fo\
2 cused on technology and strategy, your advertising needs to reach business decision-
3 makers and demonstrate expertise. Here are three distinct options for advertising yo
4 ur services:
5
6 **Targeted Content Marketing & LinkedIn Advertising:**
7
8 **Strategy:** Position yourself as a thought leader and educator in the AI/LLM space\
9 . Create valuable content that addresses the pain points and opportunities businesse
10 s face with AI (e.g., improving customer service with chatbots, automating tasks, le
11 veraging data with LLMs, ethical considerations, choosing the right models).
12
13 **Tactics:**
14 - **Blog Posts/Articles:** Write detailed articles on your website covering specif\
15 ic use cases, implementation strategies, benefits, and challenges.
16 - **Whitepapers/Ebooks:** Offer downloadable, in-depth guides on topics like "The \
17 Executive's Guide to Implementing AI in [Industry]" or "Evaluating LLM Solutions for
18 Your Business."
19 - **Webinars/Online Workshops:** Host free sessions demonstrating how AI/LLMs can \
20 solve specific business problems. This allows direct interaction and showcases exper
21 tise.
22 - **Case Studies:** Showcase successful projects (anonymized if necessary) highlig\
23 hting the business results you achieved for clients.
24 - **LinkedIn:** The primary platform for B2B networking and advertising.
25 - Share your content organically.
26 - Run targeted LinkedIn Ads based on job titles (CEO, CTO, CIO, Head of Innovation\
27 , Department Heads), industry, company size, and even specific companies. Focus ad c
28 opy on solving business problems rather than just listing services.
29 ...
Run the example to see a full report.
Observing the Difference
While this example doesn’t run a side-by-side comparison, if you were to run the same complex prompt with and without Thinking Mode (and an appropriate thinking_budget
), you might observe:
- With Thinking Mode: The output may be more structured, coherent, address all parts of the prompt more effectively, or show deeper reasoning.
- Without Thinking Mode: The output might still be good, but potentially less organized or might miss some nuances for very complex requests.
- Latency: The request with Thinking Mode enabled will take longer to return a response due to the dedicated pre-computation phase.
The key is that Thinking Mode provides an opportunity for the model to improve its response quality on complex tasks by allocating specific resources before generation starts.
Wrap Up
Gemini Thinking Mode offers a powerful way to enhance the quality of responses for complex prompts by allowing the model dedicated time and resources for planning and reasoning. By using types.GenerateContentConfig
and types.ThinkingConfig
with an appropriate thinking_budget
in your generate_content
calls via the Python SDK, you can leverage this feature for tasks demanding deeper analysis, structuring, or problem-solving. Remember to experiment with the thinking_budget
and reserve this feature for prompts where the added pre-computation phase is likely to yield significant benefits.
Using Gemini For Both Images and Text (Multimodal Capabilities)
Introduction
In the previous chapters, we explored how to interact with Gemini models using text-based prompts, including leveraging Thinking Mode for complex reasoning tasks. However, one of the significant advancements in modern AI is multimodality – the ability of models to process and understand information from multiple types of input simultaneously. Several Gemini models excel at this, allowing you to combine text prompts with images (and potentially other data types like audio or video in the future) in a single request.
This chapter focuses on demonstrating how to use the Gemini Python SDK to send both an image and a text prompt to a capable Gemini model. We will ask the model to analyze the content of the image based on instructions provided in the text prompt. This opens up powerful possibilities for visual question answering, image description, object recognition, and more.
Prerequisites and Setup
Multimodal Model: Ensure the
MODEL_ID
you use corresponds to a Gemini model version that supports image input (e.g.,gemini-1.5-pro
,gemini-1.5-flash
, or specific preview versions like the one in the example).Python Environment: You need your Python environment set up as discussed in Chapter 2.
-
Updated Dependencies: Image processing requires an additional library, Pillow (a fork of PIL - Python Imaging Library). Your
requirements.txt
file should be updated to include it:1 # requirements.txt 2 google-generativeai>=1.10.0 3 Pillow
google-generativeai>=1.10.0
: Specifies that we need version 1.10.0 or newer of the Gemini SDK. Version requirements can be important for accessing the latest features or ensuring compatibility.Pillow
: This library is essential for opening, manipulating, and preparing image files in Python so they can be sent to the model.
Remember to install or update your dependencies using:
1 pip install -r requirements.txt
Combining Text and Images in Prompts
The key to sending multimodal input using the Gemini Python SDK lies in how you structure the contents
argument passed to the generate_content
method. Instead of passing a single string (like we did for text-only prompts), you pass a list where each element represents a different part of the input.
For a text-and-image prompt, this list will typically contain:
- The text prompt (as a string).
- The image data (often loaded as a
PIL.Image
object).
The SDK handles the necessary encoding and formatting to send both the text instructions and the image pixels to the Gemini API endpoint.
Example: photo_understanding.py
- Analyzing an Image
This script demonstrates loading a local image file and sending it along with two different text prompts to a multimodal Gemini model to analyze the people depicted.
(Please ensure you have an image file located at ../data/poker.jpeg
relative to where you run this script, or modify the image
path accordingly.)
1 # --- Example: Multimodal Input (Image and Text) ---
2 # Purpose: Send an image and text prompts to Gemini
3 # to analyze the image content.
4
5 from google import genai
6 from google.genai import types
7 from PIL import Image # Used for opening the image file
8
9 import os # Used for getting current working directory and API key
10
11 # --- Configuration ---
12 try:
13 # 1. Initialize Client
14 client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
15 if not os.getenv("GOOGLE_API_KEY"):
16 raise ValueError("GOOGLE_API_KEY environment variable not set.")
17
18 # 2. Select a Multimodal Model ID
19 MODEL_ID="gemini-1.5-flash-preview-04-17"
20
21 # --- First Analysis: General Description and Bounding Boxes ---
22
23 # 3. Define the First Text Prompt
24 prompt = """
25 Return bounding boxes around people as a JSON array with labels. Never retur\
26 n masks or code fencing. Limit to 10 people.
27 Describe each person identified in a picture.
28 """
29
30 # 4. Load the Image using Pillow
31 # Assumes the image is in ../data relative to the script's CWD
32 image_path = os.path.join(os.getcwd(), "..", "data", "poker.jpeg")
33 im = Image.open(image_path)
34
35 # 5. Make the First API Call (Image + First Prompt)
36 #print(f"\n--- Sending Request 1 (General Description) for image: {image} ---")
37 response = client.models.generate_content(
38 model=MODEL_ID,
39 # Key: Pass prompt and image object together in a list
40 contents=[prompt, im],
41 # Configuration: Thinking budget set to 0 (minimal/default pre-computation)
42 config=types.GenerateContentConfig(
43 thinking_config=types.ThinkingConfig(
44 thinking_budget=0
45 )
46 )
47 )
48
49 # 6. Print the First Response
50 #print("\n--- Response 1 ---")
51 if response.parts:
52 print(response.text)
53 else:
54 print("Response 1 was empty or blocked.")
55 try:
56 print(f"Safety feedback: {response.prompt_feedback}")
57 except AttributeError:
58 pass # Ignore if feedback isn't available
59
60 # --- Second Analysis: Focus on Hands ---
61
62 # 7. Define the Second, More Specific Text Prompt
63 prompt2 = """
64 Return bounding boxes around people as a JSON array with labels. Never retur\
65 n masks or code fencing. Limit to 10 people.
66 Describe each person identified in a picture, specifically what they are hol\
67 ding in their hands.
68 """
69
70 # 8. Make the Second API Call (Image + Second Prompt)
71 # Re-uses the loaded image 'im'
72 #print("\n--- Sending Request 2 (Focus on Hands) for the same image ---")
73 response = client.models.generate_content(
74 model=MODEL_ID,
75 # Key: Pass the *new* prompt and the *same* image object
76 contents=[prompt2, im],
77 # Configuration: Same thinking budget setting
78 config=types.GenerateContentConfig(
79 thinking_config=types.ThinkingConfig(
80 thinking_budget=0
81 )
82 )
83 )
84
85 # 9. Print the Second Response
86 #print("\n--- Response 2 ---")
87 if response.parts:
88 print(response.text)
89 else:
90 print("Response 2 was empty or blocked.")
91 try:
92 print(f"Safety feedback: {response.prompt_feedback}")
93 except AttributeError:
94 pass # Ignore if feedback isn't available
95
96 except FileNotFoundError:
97 print(f"Error: Image file not found at expected path: {image}")
98 print("Please ensure the image exists at '../data/poker.jpeg' relative to your \
99 script execution directory.")
100 except ValueError as e:
101 print(f"Configuration Error: {e}")
102 except Exception as e:
103 print(f"An error occurred: {e}")
Explanation:
- Imports: We import
PIL.Image
for image handling, alongside the Gemini libraries andos
. - Client & Model: Initialization uses
genai.Client
and selects the specified multimodal preview model. - Prompt 1: Defines the initial analysis task – identify people, provide bounding boxes in JSON format (with specific negative constraints like “Never return masks or code fencing”), limit the count, and provide descriptions.
- Image Loading: The script constructs the path to the image file (
../data/poker.jpeg
) relative to the current working directory and usesImage.open()
from the Pillow library to load it into theim
variable. Error handling forFileNotFoundError
is added for robustness. -
API Call 1: This is the core of the multimodal request.
contents=[prompt, im]
: The text prompt (prompt
) and the loaded Pillow Image object (im
) are passed together as a list.config=...
: Thinking Mode is configured, butthinking_budget=0
suggests minimal or default pre-computation is requested for this task, prioritizing speed or assuming the task doesn’t require extended reasoning time.
- Response 1: The generated
response.text
should contain the JSON array of bounding boxes and the general descriptions requested. - Prompt 2: A new prompt is defined, refining the request to focus specifically on what each person is holding.
- API Call 2: A second call is made using the same loaded image (
im
) but a new text prompt (prompt2
). The structure is identical, demonstrating how you can ask different questions about the same visual input. - Response 2: The output text should now contain the bounding boxes and descriptions focused on the objects held by the people in the image.
Interpreting the Output
Here is the output:
1 [
2 {"box_2d": [198, 438, 394, 628], "label": "people"},
3 {"box_2d": [175, 592, 988, 1000], "label": "people"},
4 {"box_2d": [257, 734, 530, 876], "label": "people"},
5 {"box_2d": [203, 146, 473, 346], "label": "people"},
6 {"box_2d": [253, 0, 996, 324], "label": "people"}
7 ]
1 A group of five people are playing poker at a long table. The woman at the top of th\
2 e table has short white hair and is wearing a red scarf. The man to her left has sho
3 rt gray hair and is smiling. To his left is a man wearing a black baseball cap and a
4 gray sweatshirt. To the right of the woman with white hair is another woman with sh
5 ort brown hair, and to her right is a man with a gray beard and a green shirt.
1 [
2 {"box_2d": [198, 437, 387, 627], "label": "a person holding cards"},
3 {"box_2d": [254, 0, 1000, 326], "label": "a person holding cards"},
4 {"box_2d": [177, 593, 996, 1000], "label": "a person holding cards"},
5 {"box_2d": [205, 155, 467, 351], "label": "a person"},
6 {"box_2d": [257, 734, 464, 877], "label": "a person"}
7 ]
The response.text
from these calls will contain the model’s analysis based on both the text instructions and the visual information in poker.jpeg
. You would expect to see:
- A JSON formatted string representing the bounding boxes for detected people.
- Text descriptions corresponding to each identified person, tailored to the specifics of
prompt
orprompt2
.
Parsing the JSON and associating it with the descriptions would be the next step in a real application.
Potential Applications
This ability to process images and text together enables many applications, including:
- Visual Question Answering (VQA): Ask specific questions about an image (“What color is the car?”, “Are there any animals in this picture?”).
- Image Captioning/Description: Generate detailed textual descriptions of an image’s content.
- Object Recognition & Analysis: Identify objects and describe their attributes or relationships (“Describe the furniture in the room.”).
- Data Extraction from Images: Pull text or structured data from photos of documents, whiteboards, or signs.
- Image Comparison: Provide two images and ask the model to describe the differences.
Wrap Up
Gemini’s multimodal capabilities significantly broaden the scope of tasks you can accomplish. By leveraging the Python SDK and libraries like Pillow, you can easily combine text prompts and image data in a single API call. The key is structuring the contents
parameter as a list containing both your textual instructions and the loaded image object. This allows you to build sophisticated applications that can “see” and reason about visual information based on your specific textual guidance. Remember to use a model that supports multimodal input and include Pillow
in your project dependencies.
Using Gemini to Write Python for Google Colab
I made available on the web at URI https://markwatson.com/data/wisconsin_with_header.csv a sample CSV file for cancer research. Download this file to your laptop; we will ask Colab to Analyze files with Gemini later and use this file for an example.
Prompt:
1 Analyze this cancer data spreadsheet file and describe the columns.
2
3 Write a Python script that uses IPython to fetch this CSV file and make a bar plot o\
4 f it using matplotlib. The code should work in Google Colab.
I will create a new empty Colab notebook and work through an example. Here is the link to the saved notebook:
1 https://colab.research.google.com/drive/17EQ0TUCrHg3wWduBWj9KMJPc-9Lp669R?usp=sharing
The empty notebook has the button Analyze files with Gemini:

In the next screenshot I uploaded the sample CSV file and used the prompt listed above:

In the following screenshot, I have closed the Gemini window on the right side of the Colab notebook and show example output:

If you look at the linked Colab notebook you will notice that some of the generated analysis code terminated due to runtime errors and Gemini fixed these errors.
Wrap Up
I use Google Colab notebooks frequently in my work. The example in this chapter demonstrates having Gemini write Python code for you.
Book Wrap Up
Dear reader, I hope that you have found this book useful and fun to read. Please be advised that this book is not “done” in the traditional sense: On average, I update my LeanPub books four or five times the first year after publication, and less often after one year. I have updated my Common Lisp book over thirty times with new material since I published it a decade ago. My message to you is this: if you are reading this book for free online, please revisit the book link periodically for new material. If you bought this book and asked for notifications of new versions, then you will receive an email whenever I produce a new edition with a free download link.
Best regards,
Mark Watson