Overview of LlamaIndex

The popular LlamaIndex project used to be called GPT-Index but has been generalized to work with many LLM models like GPT-4, Hugging Face, Anthropic, local models run using Ollama, and many other models.

LlamaIndex is a project that provides a central interface to connect your language models with external data. It was created by Jerry Liu and his team in the fall of 2022. It consists of a set of data structures designed to make it easier to use large external knowledge bases with language models. Some of its uses are:

Querying structured data such as tables or databases using natural language
Retrieving relevant facts or information from large text corpora
Enhancing language models with domain-specific knowledge

LlamaIndex supports a variety of document types, including:

Text documents are the most common type of document. They can be stored in a variety of formats, such as .txt, .doc, and .pdf.
XML documents are a type of text document that is used to store data in a structured format.
JSON documents are a type of text document that is used to store data in a lightweight format.
HTML documents are a type of text document that is used to create web pages.
PDF documents are a type of text document that is used to store documents in a fixed format.

LlamaIndex can also index data that is stored in a variety of databases, including:

SQL databases such as MySQL, PostgreSQL, and Oracle. NoSQL databases such as MongoDB, Cassandra, and CouchDB.
Solr is a popular open-source search engine that provides high performance and scalability.
Elasticsearch is another popular open-source search engine that offers a variety of features, including full-text search, geospatial search, and machine learning.
Apache Cassandra is a NoSQL database that can be used to store large amounts of data.
MongoDB is another NoSQL database that is easy to use and scale.
PostgreSQL is a relational database that is widely used in enterprise applications.

LlamaIndex is a flexible framework that can be used to index a variety of document types and data sources.

Compared to LangChain, LlamaIndex presents a focused advantage in the realm of indexing and retrieval tasks, making it a highly efficient choice for applications that prioritize these functions. Its design is tailored specifically for the efficient ingestion, structuring, and accessing of private or domain-specific data, which is crucial for applications that rely heavily on the quick retrieval of accurate and relevant information from large datasets. The streamlined interface of LlamaIndex simplifies the process of connecting custom data sources to large language models (LLMs), thereby reducing the complexity and development time for search-centric applications. This focus on indexing and retrieval, as highlighted in the search results, leads to increased speed and accuracy in search and summarization tasks, setting LlamaIndex apart as the go-to framework for developers building intelligent search tools.

Another significant advantage of LlamaIndex is its integration capabilities with a wide array of tools and services, which enhances the functionality and versatility of LLM-powered applications. The framework’s ability to merge with vector stores like Pinecone and Milvus facilitates efficient document search and retrieval. Additionally, its compatibility with tracing tools such as Graphsignal offers insights into LLM-powered application operations, while integration with application frameworks like LangChain and Streamlit enables easier building and deployment. These integrations extend to data loaders, agent tools, and observability tools, thus enhancing the capabilities of data agents and offering various structured output formats to facilitate the consumption of application results. This extensive integration ecosystem empowers developers to create powerful, versatile applications with minimal effort.

Lastly, LlamaIndex’s specialized focus on indexing and retrieval is complemented by its simplicity and ease of use, making it an attractive option for developers seeking to build efficient and straightforward search experiences. The framework’s optimization for these specific tasks, in comparison to more general-purpose frameworks like LangChain, results in a tool that is not only more efficient for search and retrieval applications but also easier to learn and implement. This simplicity is particularly beneficial for projects with tight deadlines or for developers new to working with LLMs, as it allows for the quick deployment of high-performance applications without the need for extensive customization or complex setup processes.

We will look at a short example derived from the LlamaIndex documentation.

Using LlamaIndex for Question Answering from a Web Site

In this example we use the trafilatura and html2text libraries to get text from a web page that we will index and search. The class TrafilaturaWebReader does the work of creating local documents from a list of web page URIs and the index class VectorStoreIndex builds a local index for use with OpenAI API calls to implement search.

1  pip install -U trafilatura html2text

The following listing shows the file web_page_QA.py:

 1 # Derived from examples in llama_index documentation
 2 
 3 # pip install llama-index html2text trafilatura
 4 
 5 from pprint import pprint
 6 from llama_index.core import Document
 7 import trafilatura
 8 
 9 from llama_index.core import VectorStoreIndex
10 
11 def query_website(url, *questions):
12     downloaded = trafilatura.fetch_url(url)
13     text = trafilatura.extract(downloaded)
14     #print(text)
15     list_of_documents = [Document(text=text)]
16     index = VectorStoreIndex.from_documents(list_of_documents)   #.from_texts([text])
17     engine = index.as_query_engine()
18     for question in questions:
19         print(f"\n== QUESTION: {question}\n")
20         response = engine.query(question)
21         print(f"== RESPONSE: {response}")
22 
23 if __name__ == "__main__":
24   url = "https://markwatson.com"
25   query_website(url, "What instruments does Mark play?",
26                      "How many books has Mark written?")

This example is not efficient because we create a new index for each web page we want to search. That said, this example (that was derived from an example in the LlamaIndex documentation) implements a pattern that you can use, for example, to build a reusable index of your company’s web site and build an end-user web search app.

The output for these three test questions in the last code example is:

1  $ python web_page_QA.py
2 
3 == QUESTION: What instruments does Mark play?
4 
5 == RESPONSE: Mark plays the guitar, didgeridoo, and American Indian flute.
6 
7 == QUESTION: How many books has Mark written?
8 
9 == RESPONSE: Mark has written 9 books.

Note that the answer to the second question is strictly incorrect since it counted the books mentioned in the text. It did this correctly. However, the Trafilatura library skipped the text in the header block of my web site that said I have written over 20 books. This inaccuracy if from my use of the Trafilatura library.

LlamaIndex Case Study Wrap Up

LlamaIndex is a set of data structures and library code designed to make it easier to use large external knowledge bases such as Wikipedia. LlamaIndex creates a vectorized index from your document data, making it highly efficient to query. It then uses this index to identify the most relevant sections of the document based on the query.

LlamaIndex is useful because it provides a central interface to connect your LLM’s with external data and offers data connectors to your existing data sources and data formats (API’s, PDF’s, docs, SQL, etc.). It provides a simple, flexible interface between your external data and LLMs.

Some projects that use LlamaIndex include building personal assistants with LlamaIndex and GPT-4, using LlamaIndex for document retrieval, and combining answers across documents.

Up next

Extraction of Facts and Relationships from Text Data