Ollama API Programming: Build Custom AI Apps & Scripts

Have you used Ollama to chat with powerful AI models right on your computer? It’s amazing for quick questions or creative ideas. But what if you want to build AI directly into your own tools or scripts? That’s where the real power for developers lies. Ollama offers a robust API. This API lets you programmatically interact with local AI models from your code. You can move beyond simple chat interfaces. You can unlock automation, integration, and complex custom logic. This article will introduce you to Ollama API programming. We will show basic examples. This will inspire you to build your own AI projects.

Prerequisites

Before you start coding with the Ollama API, you need a few things. First, you must have Ollama installed and running on your machine. This is essential as your code will talk to the running Ollama service. If you haven’t installed it yet, install Ollama now.

Next, you need basic programming knowledge. We assume you know Python for the examples here. Python is a popular language for scripting and AI tasks. Finally, the easiest way to interact with Ollama from Python is using the official ollama Python library. We have a separate guide that walks through installing this library and making your first basic API request. Please refer to that guide for detailed setup steps.

Recap: Ollama API Fundamentals

Ollama runs a local web server on your computer. This server provides a REST API. Your programs can send requests to this API. The default address for this API is http://localhost:11434.

There are several endpoints you can use. The most important for programming are /api/generate and /api/tags. The /api/generate endpoint handles text generation. The /api/tags endpoint gives you information about your installed models. We cover the basics of sending requests and understanding responses in our basic API tutorial. Now let’s see why you would use this API for programming tasks.

Why Program with the Ollama API? Use Cases for Developers

Using the Ollama API in your code offers significant advantages over just using the chat interface. You gain control and flexibility. Here are some key reasons why developers program with the Ollama API.

Automation

You can automate tasks that involve AI models. Imagine processing a folder of text files. You could summarize each file automatically. Or you could extract specific information from many documents. The API lets your script send text to the model and receive the output without manual steps. This saves time and effort for repetitive tasks.

Building Custom Interfaces

You are not limited to the standard Ollama interface. You can build your own application. This could be a web app, a desktop tool, or even a mobile app. Your custom interface can send requests to the local Ollama API running in the background. This allows you to design user experiences tailored to specific needs.

Integration into Existing Workflows

The Ollama API integrates AI into your current tools. Do you have scripts that process data? Add a step that uses an AI model to analyze or transform text. Connect AI capabilities to your existing software. This makes your workflows smarter without sending sensitive data to external services. Integrate Ollama AI easily.

Complex Logic

AI models can be just one part of a larger program. Your code can use the AI’s response in a more complex way. Maybe the AI generates text. Your program then analyzes that text. Or maybe the AI helps decide the next action for your script. You can build sophisticated applications where the AI plays a specific role within broader logic.

Key Programming Capabilities with the Ollama API

The Ollama API gives you programmatic access to various model functions. The ollama Python library simplifies using these functions. Let’s look at some core capabilities you can program with.

Generating Text with Control (/api/generate)

The /api/generate endpoint is central for creating text. When you use the API, you send a JSON payload. This payload includes the model name and the prompt. You can also include many other parameters. These parameters give you fine-grained control over the generation process. You can control creativity, length, and more.

Here is a Python example. It uses the ollama library to generate text. Also shows how to set parameters like temperature and num_predict and demonstrates prompt engineering within the API call.

import ollama

client = ollama.Client()
try:
    response = client.generate(
        model='mistral', # Or another model you have installed
        prompt='Act as a creative writer. Write a short, humorous limerick about a rubber duck.',
        options={
            'temperature': 0.8, # Make it more creative (0.0 to 1.0)
            'num_predict': 60 # Limit length to approx 60 tokens
        }
    )
    print("Limerick:\n", response.get("response"))
except ollama.ResponseError as e:
     print(f"Ollama API Error: {e}")
     print("Ensure model is downloaded and Ollama is running.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

In this code, we create a client object. We then call the generate method. We pass the model name and the prompt. The options dictionary lets us pass specific parameters. temperature controls randomness. Higher values mean more creative output. num_predict sets a limit on the number of tokens generated. The API returns a dictionary. We access the generated text using response.get("response"). Error handling is important for robust code.

Handling Streaming Responses (/api/generate with stream: True)

When you use the chat interface, text appears as it’s generated. This is called streaming. The Ollama API supports streaming responses. By default, the ollama Python library uses stream=True. Streaming is great for interactive applications. It makes the user experience feel faster. Your program receives parts of the response as they are ready.

Handling a streaming response in code means processing data as it arrives. You typically loop through the response. Here is how you do it with the ollama library.

import ollama
client = ollama.Client()

# Default is stream=True, but you can set it explicitly
try:
    response_stream = client.generate(
        model='llama2', # Use a model you have
        prompt='Tell me a story about a cat who learned to fly.',
        stream=True
    )
    print("Story (Streaming):")
    for part in response_stream:
        # Each 'part' is a dictionary containing a piece of the response
        if 'response' in part:
             # Print the text part without a newline, flush buffer immediately
             print(part['response'], end='', flush=True)
    print("\nEnd of story.") # Add a final newline after the stream finishes
except ollama.ResponseError as e:
     print(f"Ollama API Error: {e}")
     print("Ensure model is downloaded and Ollama is running.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

This code calls client.generate with stream=True. It gets back an iterator object, response_stream. We loop through this object. Each part in the loop is a dictionary. It contains a small piece of the generated text in the 'response' key. We print each part immediately. end='' prevents adding a newline after each part. flush=True makes sure the text appears right away. This gives the streaming effect.

Programmatically Managing Models (/api/tags, /api/pull, /api/rm)

The Ollama API also lets you manage your models programmatically. The /api/tags endpoint is used to list the models you have installed. This is useful if your application needs to know which models are available locally. The ollama library provides a simple method for this.

Here is a Python example to list your available models.

import ollama
client = ollama.Client()
try:
    models = client.list() # Uses the /api/tags endpoint
    print("Available Models:")
    if models and "models" in models:
        for model in models["models"]:
            # Format size from bytes to GB for readability
            size_gb = round(model['size'] / (1024*1024*1024), 2)
            print(f"- {model['name']} (Size: {size_gb} GB)")
    else:
        print("No models found or unexpected response format.")
except ollama.ResponseError as e:
    print(f"Ollama API Error: {e}")
    print("Ensure Ollama is running.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

This script calls client.list(). It gets a dictionary back containing model information under the "models" key. We loop through this list and print each model’s name and size. You can also use client.pull('model_name') to download a model programmatically or client.delete('model_name') to remove one. These methods allow your application to manage models without manual user interaction.

The Embeddings API (/api/embeddings)

Beyond generating text, AI models can create embeddings. Embeddings are numerical representations of text. They capture the meaning and context of words or sentences. The /api/embeddings endpoint provides access to this capability.

Embeddings are crucial for tasks like semantic search. They are also used in clustering similar texts. Retrieval Augmented Generation (RAG) systems heavily rely on embeddings. While using embeddings is an advanced topic, knowing the API exists is important.

Here is a very simple example showing how to get an embedding for a sentence. You will need an embeddings model installed, like nomic-embed-text.

# Assuming you have the ollama client already initialized
# 
client = ollama.Client()
try:
    embedding_response = client.embeddings(
        model='nomic-embed-text', # Requires an embeddings model
        prompt='This is a test sentence.'
    )
    # The embedding is a list of numbers. Print the first few.
    print("Embedding (first 5 dimensions):", embedding_response['embedding'][:5])
except ollama.ResponseError as e:
     print(f"Ollama API Error: {e}")
     print("Ensure 'nomic-embed-text' or similar embeddings model is downloaded.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

This code calls client.embeddings(). It sends the model name and the text prompt. The response contains the computed embedding. The embedding is a list of numbers. This numerical vector represents the meaning of the text. Programming with embeddings opens up many possibilities for understanding and working with text data.

Ideas for Your Own Projects

Now that you have seen some Ollama API programming examples, what can you build? The possibilities are vast. Start small and expand your ideas. Here are some project ideas to get you started:

Create a command-line tool. It could summarize text files or code snippets you pass to it.

Build a simple desktop assistant. It could integrate local AI responses into your workflow.

Develop a basic web interface. This allows others on your local network to use your local LLM easily.

Write scripts for processing large text datasets offline. Use AI for analysis or transformation.

Generate variations of marketing copy or creative text programmatically.

Build data extraction scripts. Use the AI to pull specific information from unstructured text.

Troubleshooting API Programming Issues

Sometimes things don’t work as expected when programming with APIs. If you run into issues with the Ollama API, first check the general Ollama Troubleshooting guide.

For API-specific problems, verify a few things. Ensure the Ollama service is running in the background. Check that the API port (default 11434) is not blocked or used by another program. Double-check your code for correct JSON payload syntax. Look at the error codes returned by the API response. Make sure the model name you specify in your code is downloaded and available.

FAQs

Q: What programming languages can I use with the Ollama API?

A: Any language that can send HTTP requests can use the Ollama API. Python is common due to the available client library, but you could use JavaScript, Go, Java, Ruby, etc., by sending standard HTTP requests.

Q: Can I expose the local Ollama API to the internet?

A: It is generally not recommended to expose your local Ollama API directly to the public internet without security measures. It’s designed for local use. If you need remote access, consider building a secure backend service that interacts with the local API.

Q: How do I handle long conversations or chat history via the API?

A: The /api/generate endpoint is stateless for single requests. For conversational flows, you need to manage the history in your application code. You pass previous user and assistant messages back to the API in subsequent requests to maintain context.

Q: What’s the best model for coding via API?

A: The “best” model depends on your task and hardware. For coding tasks, models like Code Llama, deepseek-coder, or Phind-CodeLlama are popular choices. Experiment with different model types available in Ollama to find what works best for your specific programming needs.

Conclusion

Using the Ollama API unlocks the true potential of local AI models for developers. You move beyond simple chat. Gain the power to integrate AI directly into your software. You can automate tasks, build custom applications, and add intelligent features to existing tools. The ollama Python library makes interacting with the API straightforward. We have shown examples for generating text with control, handling streaming, managing models, and using embeddings. Start experimenting with these code examples. Begin building your own local AI-powered projects today. The world of custom AI applications is now open to you.