LLM Input/Output - Advanced Operations

Table of contents

Tracking LLM Costs
Caching in LangChain
- InMemoryCache
- SQLite Cache
Streaming in LLMs

The following operations are covered in this notebook:

Cost Monitoring
Caching
Streaming

Chat Models and LLMs

Accessing Commercial LLMs like ChatGPT

from langchain_openai import ChatOpenAI

# instantiate the model
llm = ChatOpenAI(
    model='gpt-3.5-turbo',
    temperature=0
)

Tracking LLM Costs

Usually, large language models (LLMs) like ChatGPT charge based on the number of tokens used per request and response. Tokens are essentially chunks of text, and the cost is calculated based on how many tokens are processed in a given interaction. You can track your token usage for specific API calls to manage and optimize your costs effectively.

Currently, this tracking feature is only implemented for the OpenAI API, allowing users to monitor and control their token consumption more precisely.

from langchain_community.callbacks import get_openai_callback

prompt = """Explain Deep Learning in one sentence"""

with get_openai_callback() as callback:
  response = llm.invoke(prompt)
  print(response.content)
  print(callback)

Deep learning is a subset of machine learning that uses artificial neural networks to model and solve complex problems by learning from large amounts of data.
Tokens Used: 41
    Prompt Tokens: 14
    Completion Tokens: 27
Successful Requests: 1
Total Cost (USD): $7.500000000000001e-05

print(callback.total_tokens)

print(callback.prompt_tokens, callback.completion_tokens)

(14, 27)

print(callback.total_cost)

7.500000000000001e-05

Caching in LangChain

LangChain includes an optional caching layer for language model APIs (LLMs), offering significant benefits in terms of cost efficiency and performance improvement.

Cost Efficiency:
- The caching feature helps reduce the number of API calls made to LLM providers. By storing responses, you can avoid repeatedly requesting the same completions.
- This is particularly advantageous for applications that frequently make identical requests, as it can substantially lower operational costs.
Performance Improvement:
- Caching can greatly enhance your application’s speed by minimizing the need for repeated API calls to the LLM provider.
- With cached responses readily available, interactions become faster and more efficient, leading to a smoother user experience and quicker processing times.

Overall, LangChain’s caching layer is a valuable feature for optimizing both the cost and performance of applications using language model APIs.

InMemoryCache

%%time

from langchain.cache import InMemoryCache
from langchain.globals import set_llm_cache

set_llm_cache(InMemoryCache())

from langchain_core.prompts import ChatPromptTemplate

# first invoke to the model, as it will take loner time to execute
prompt = """Explain what is Digital Image Processing"""
template = ChatPromptTemplate.from_template(template=prompt)
llm.invoke(template.format())

CPU times: user 580 ms, sys: 36.1 ms, total: 616 ms
Wall time: 2.46 s

AIMessage(content='Digital image processing is the manipulation and analysis of digital images using various algorithms and techniques to enhance, compress, or extract information from the images. It involves processing images captured by digital cameras or generated by computer graphics to improve their quality, extract useful information, or perform specific tasks such as image recognition, pattern recognition, and image restoration. Digital image processing is widely used in various fields such as medical imaging, remote sensing, surveillance, and multimedia applications.', response_metadata={'token_usage': {'completion_tokens': 89, 'prompt_tokens': 15, 'total_tokens': 104}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None})

%%time
# second invoke to the model, and it executes faster
llm.invoke(template.format())

CPU times: user 2.87 ms, sys: 0 ns, total: 2.87 ms
Wall time: 2.72 ms

AIMessage(content='Digital image processing is the manipulation and analysis of digital images using various algorithms and techniques to enhance, compress, or extract information from the images. It involves processing images captured by digital cameras or generated by computer graphics to improve their quality, extract useful information, or perform specific tasks such as image recognition, pattern recognition, and image restoration. Digital image processing is widely used in various fields such as medical imaging, remote sensing, surveillance, and multimedia applications.', response_metadata={'token_usage': {'completion_tokens': 89, 'prompt_tokens': 15, 'total_tokens': 104}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None})

SQLite Cache

from langchain.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path="langchain.db"))

%%time

prompt = """one use cse of topological sorting"""

# first invoke to the model, as it will take loner time to execute
template = ChatPromptTemplate.from_template(prompt)
llm.invoke(template.format())

CPU times: user 50.9 ms, sys: 2.76 ms, total: 53.6 ms
Wall time: 1.37 s

AIMessage(content='One use case of topological sorting is in scheduling tasks or activities that have dependencies on each other. For example, in project management, tasks need to be completed in a specific order to ensure that all dependencies are met. Topological sorting can help determine the order in which tasks should be executed to meet these dependencies and complete the project efficiently.', response_metadata={'token_usage': {'completion_tokens': 68, 'prompt_tokens': 17, 'total_tokens': 85}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None})

%%time
# second invoke to the model, and it executes faster
llm.invoke(template.format())

CPU times: user 109 ms, sys: 58.6 ms, total: 168 ms
Wall time: 167 ms

AIMessage(content='One use case of topological sorting is in scheduling tasks or activities that have dependencies on each other. For example, in project management, tasks need to be completed in a specific order to ensure that all dependencies are met. Topological sorting can help determine the order in which tasks should be executed to meet these dependencies and complete the project efficiently.', response_metadata={'token_usage': {'completion_tokens': 68, 'prompt_tokens': 17, 'total_tokens': 85}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-a4199711-8363-4a3a-89c6-8144ac8d1178-0')

Streaming in LLMs

All language model interfaces (LLMs) in LangChain implement the Runnable interface, which provides a set of default methods, including ainvoke, batch, abatch, stream, and astream. This comprehensive setup ensures that all LLMs are equipped with basic streaming capabilities, facilitating both synchronous and asynchronous operations.

Streaming Defaults

Synchronous Streaming:
- By default, synchronous streaming operations return an Iterator that yields a single value, representing the final result from the LLM provider.
- This approach allows for efficient processing of responses in a sequential manner.
Asynchronous Streaming:
- Similarly, asynchronous streaming operations default to returning an AsyncIterator that also yields the final result.
- This method supports non-blocking operations, enabling the handling of multiple tasks concurrently and improving overall performance in applications requiring real-time interactions.

This standardized interface and default behavior streamline the implementation of various language models within LangChain, ensuring consistency and ease of use across different streaming scenarios.

prompt = """one advantage of distributed computing in one line"""
template = ChatPromptTemplate.from_template(prompt)

# streaming response
for chunk in llm.stream(template.format()):
    print(chunk.content)

Increased
fault
tolerance
and
reliability
due
to
multiple
nodes
working
together
.

prompt = """Explain distributed computing in detail"""
template = ChatPromptTemplate.from_template(prompt)

response = []
for chunk in llm.stream(template.format()):
    print(chunk.content, end = "")
    response.append(chunk.content)

Distributed computing is a computing paradigm in which multiple computers work together on a task, sharing resources and processing power to achieve a common goal. In a distributed computing system, tasks are divided into smaller sub-tasks and distributed among multiple computers, often referred to as nodes or servers. These nodes communicate with each other over a network to coordinate their efforts and share information.
One of the key advantages of distributed computing is its ability to harness the collective power of multiple computers to solve complex problems more quickly and efficiently than a single computer could on its own. By distributing the workload among multiple nodes, distributed computing systems can handle larger volumes of data and perform computations at a faster rate.
There are several different models of distributed computing, including client-server architecture, peer-to-peer networks, and grid computing. In a client-server architecture, one or more central servers coordinate the activities of multiple client computers, which request and receive services from the server. In a peer-to-peer network, all nodes have equal status and can communicate directly with each other to share resources and information. Grid computing involves connecting multiple computers across different locations to form a virtual supercomputer, enabling large-scale computations and data processing.
Distributed computing is used in a wide range of applications, including scientific research, data analysis, financial modeling, and cloud computing. It offers scalability, fault tolerance, and high performance, making it a valuable tool for organizations looking to leverage the power of multiple computers to solve complex problems.