一、背景
大语言模型特别的火,几乎每个人都在用,几乎每个项目都涉及。开发LLM相关的项目,Langchain是绕不开的;有可能你使用了某个组织给你提供的LLM 开发SDK,从而没有感受到Langchain;你要扒开那个SDK的皮看看,里边还是Langchain,所以还是学学Langchain吧。我的决定学习Langchain之后,看了两眼文档放弃了,眼睛受不了。直接在github上搜了一下,找到了langchain-examples。从代码开发吧,需要的时候,再看文档。
代码地址:https://github.com/alphasecio/langchain-examples.git
二、环境搭建
用Pycharm拉取上边的地址:https://github.com/alphasecio/langchain-examples.git建立工程。 项目的Python解释器选择3.10版本。这个版本很重要,让你在按照依赖的时候少走弯路。你要是用了3.13,即使有AI帮忙,你爬不出坑。
定了python版本之后,点开项目,报错的地方,打开详情安装依赖。这样做的好处是不用全部安装完,用到谁安装谁。你也可以全部检视一下代码,看到没有的依赖就安装。也可以利用项目提供的requirements.txt文件,完成安装。具体怎么做,问deepseek或者ChatGPT就可以了。我们的重点在代码。
三、代码介绍
我首先调试的是all-in-one子项目。
打开“终端”窗口,运行:streamlit run all-in-one/Home.py;浏览器会打开一个页面:地址栏:http://localhost:8501/;页面长这样:
我第一次看到这个页面产生一个疑问:我只运行了home.py. 导航栏从哪里来的。我去看代码,代码中页没有导航的代码,只有右半部分的内容。那么这个左右结构从哪里来?这是streamlit的特性,注意代码结构有一个pages目录,目录下的页面就是导航栏中的项。home是一个特别的项, 跟pages目录平级。
1、修改settings页
在pages下找到settings.py。代码不纪录用户的输入,每次重新调试都要输入一遍测试。对调试来说太啰嗦了。可以增加config.json配置文件。修改代码,让setting先从config.json加载展示在页面上(第一次是空的);如果用户要修改就修改结果。这样就方便多了。示例代码如下:
import streamlit as st
import json
import os
# 配置文件路径
CONFIG_PATH = "../config.json"
# 默认值
default_config = {
"openai_api_key": "自己填写",
"serper_api_key": "自己填写",
"model": "gpt-3.5-turbo",
"base_url": "自己填写"
}
# 加载配置文件
def load_config():
if os.path.exists(CONFIG_PATH):
with open(CONFIG_PATH, "r") as f:
return json.load(f)
return default_config.copy()
# 保存配置文件
def save_config(config):
with open(CONFIG_PATH, "w") as f:
json.dump(config, f, indent=2)
# 初始化 session_state
if "config" not in st.session_state:
st.session_state.config = load_config()
st.title("🔧 Settings")
# 显示并编辑配置
st.text_input("OpenAI API Key", key="openai_api_key", value=st.session_state.config.get("openai_api_key", ""), type="password")
st.text_input("Serper API Key", key="serper_api_key", value=st.session_state.config.get("serper_api_key", ""), type="password")
st.selectbox("Model", ["gpt-3.5-turbo", "gpt-4"], key="model", index=["gpt-3.5-turbo", "gpt-4"].index(st.session_state.config.get("model", "gpt-3.5-turbo")))
st.text_input("Base URL", key="base_url", value=st.session_state.config.get("base_url", ""))
# 保存按钮
if st.button("Save"):
new_config = {
"openai_api_key": st.session_state.openai_api_key,
"serper_api_key": st.session_state.serper_api_key,
"model": st.session_state.model,
"base_url": st.session_state.base_url,
}
save_config(new_config)
st.session_state.config = new_config
st.success("✅ Configuration saved!")
2、search页
先从google搜索,然后由LLM摘要,形成输出。核心代码就三行:
tools = load_tools(["google-serper"], llm, serper_api_key=st.session_state.config.get("serper_api_key", ""))
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)
result = agent.run(search_query)
* 加载 google-serper 工具,通过 Serper API 执行 Google 搜索
* 初始化代理
* 执行代码
这段代码通过代理方法,先从google搜索,对搜索结果再通过LLM摘要汇总,形成结论。
3、URL Summary
url_summary.py的功能就是用户提供的URL页面的内容进行摘要。调试这段代码的时候要注意程序的输出,大概率在胡说,我修改了代码,才能正常执行,达到预期的目标:
import validators, streamlit as st
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import YoutubeLoader, UnstructuredURLLoader
# 初始化 config(只在首次加载时)
if "config" not in st.session_state:
from pathlib import Path
import json
if Path("../config.json").exists():
with open("config.json", "r") as f:
st.session_state.config = json.load(f)
else:
st.session_state.config = {
"openai_api_key": "自己填",
"serper_api_key": "自己填",
"model": "gpt-3.5-turbo",
"base_url": "自己填"
}
# Streamlit app
st.subheader('URL Summary')
url = st.text_input("Enter Source URL")
# If 'Summarize' button is clicked
if st.button("Summarize"):
# Validate inputs
if not st.session_state.config.get("openai_api_key", "") :
st.error("Please provide the missing API keys in Settings.")
elif not url:
st.error("Please provide the URL.")
elif not validators.url(url):
st.error("Please enter a valid URL.")
else:
try:
with st.spinner("Please wait..."):
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(url)
data = loader.load()
# # Load URL data
# if "youtube.com" in url:
# loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
# else:
# loader = UnstructuredURLLoader(urls=[url], ssl_verify=False, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_5_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"})
# data = loader.load()
# Initialize the ChatOpenAI module, load and run the summarize chain
#llm = ChatOpenAI(temperature=0, model='gpt-3.5-turbo', openai_api_key=openai_api_key)
llm = ChatOpenAI(temperature=0,
openai_api_key=st.session_state.config.get("openai_api_key", ""),
model_name=st.session_state.config.get("model", ""), # 或你传入的 model 变量
base_url=st.session_state.config.get("base_url", ""),
verbose=True)
prompt_template = """Write a summary of the following in 250-300 words.
{text}
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt)
summary = chain.run(data)
st.success(summary)
except Exception as e:
st.exception(f"Exception: {e}")
怎么发现它在胡说:它说的内容跟url内容毫无关系;而且在换了URL之后,它的返回没有改变。代码看起来长,核心代码还是标红的三句话。
Url_summar这样的功能,是不是配合爬虫使用。有什么场景请网友反馈一下,大家一起看看。
4、Text_Summary
功能比较简单,要说明的部分对文本分成块,只拿前三块进行摘要:
import streamlit as st
from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from langchain_community.docstore.document import Document
from langchain_text_splitters import CharacterTextSplitter
# Set API keys from session state
#openai_api_key = st.session_state.openai_api_key
# Streamlit app
st.subheader('Text Summary')
source_text = st.text_area("Enter Source Text", height=200)
# If the 'Summarize' button is clicked
if st.button("Summarize"):
# Validate inputs
if not st.session_state.config.get("openai_api_key", ""):
st.error("Please provide the missing API keys in Settings.")
elif not source_text.strip():
st.error("Please provide the source text.")
else:
try:
with st.spinner('Please wait...'):
# Split the source text
text_splitter = CharacterTextSplitter()
texts = text_splitter.split_text(source_text)
# Create Document objects for the texts (max 3 pages)
docs = [Document(page_content=t) for t in texts[:3]]
# Initialize the OpenAI module, load and run the summarize chain
llm = ChatOpenAI(temperature=0,
openai_api_key=st.session_state.config.get("openai_api_key", ""),
model_name=st.session_state.config.get("model", ""), # 或你传入的 model 变量
base_url=st.session_state.config.get("base_url", ""),
verbose=True)
chain = load_summarize_chain(llm, chain_type="map_reduce")
summary = chain.run(docs)
st.success(summary)
except Exception as e:
st.exception(f"An error occurred: {e}")
这段代码修改较少,LLM这块因为key的问题我做了改变。
5、document Summary
该功能要求上传PDF文件在2兆以内。两外一个注意点,有关大模型的参数,一个模型是用来Embedding的,一个用来chat。所以要改:
import os, tempfile, streamlit as st
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import PyPDFLoader
# Set API keys from session state
#openai_api_key = st.session_state.openai_api_key
# Streamlit app
st.subheader('Document Summary')
source_doc = st.file_uploader("Upload Source Document", type="pdf")
# If the 'Summarize' button is clicked
if st.button("Summarize"):
# Validate inputs
#if not openai_api_key:
if not st.session_state.config.get("openai_api_key", ""):
st.error("Please provide the missing API keys in Settings.")
elif not source_doc:
st.error("Please provide the source document.")
else:
try:
with st.spinner('Please wait...'):
# Save uploaded file temporarily to disk, load and split the file into pages, delete temp file
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
tmp_file.write(source_doc.read())
loader = PyPDFLoader(tmp_file.name)
pages = loader.load_and_split()
os.remove(tmp_file.name)
# Create embeddings for the pages and insert into Chroma database
#embeddings = OpenAIEmbeddings(openai_api_key=st.session_state.config.get("openai_api_key", ""))
embeddings = OpenAIEmbeddings(
openai_api_key=st.session_state.config.get("openai_api_key", ""),
base_url=st.session_state.config.get("base_url", ""),
model="text-embedding-3-small"
)
vectordb = Chroma.from_documents(pages, embeddings)
# Initialize the OpenAI module, load and run the summarize chain
#llm = ChatOpenAI(temperature=0, openai_api_key=openai_api_key)
llm = ChatOpenAI(temperature=0,
openai_api_key=st.session_state.config.get("openai_api_key", ""),
model_name=st.session_state.config.get("model", ""),
base_url=st.session_state.config.get("base_url", ""),
verbose=True)
chain = load_summarize_chain(llm, chain_type="stuff")
search = vectordb.similarity_search(" ")
summary = chain.run(input_documents=search, question="Write a summary within 200 words.")
st.success(summary)
except Exception as e:
st.exception(f"An error occurred: {e}")
蓝色部分是核心代码。
6、News summary
调用google API完成搜索,最多搜三个;然后再对三个URL的内容进行摘要:
import streamlit as st
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain_community.utilities import GoogleSerperAPIWrapper
from langchain.text_splitter import CharacterTextSplitter
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 初始化 config(如果未配置)
if "config" not in st.session_state:
from pathlib import Path
import json
if Path("../config.json").exists():
with open("../config.json", "r") as f:
st.session_state.config = json.load(f)
else:
st.session_state.config = {
"openai_api_key": "",
"serper_api_key": "",
"model": "gpt-3.5-turbo",
"base_url": "https://api.chatanywhere.tech/v1"
}
# 提取配置
openai_api_key = st.session_state.config.get("openai_api_key", "")
serper_api_key = st.session_state.config.get("serper_api_key", "")
model_name = st.session_state.config.get("model", "")
base_url = st.session_state.config.get("base_url", "")
# Streamlit UI
st.subheader('News Summary')
col1, col2 = st.columns([3, 1])
search_query = col1.text_input("Search Query")
num_results = col2.number_input("Number of Results", min_value=3, max_value=10)
col3, col4 = st.columns([1, 3])
# 仅搜索
if col3.button("Search"):
if not openai_api_key or not serper_api_key:
st.error("Please provide the missing API keys in Settings.")
elif not search_query.strip():
st.error("Please provide the search query.")
else:
try:
with st.spinner("Please wait..."):
search = GoogleSerperAPIWrapper(type="news", tbs="qdr:w1", serper_api_key=serper_api_key)
result_dict = search.results(search_query)
if not result_dict.get("news"):
st.error(f"No results found for: {search_query}")
else:
for i, item in zip(range(num_results), result_dict["news"]):
st.success(f"Title: {item['title']}\n\nLink: {item['link']}\n\nSnippet: {item['snippet']}")
except Exception as e:
st.exception(f"Exception: {e}")
#Search & Summarize
if col4.button("Search & Summarize"):
if not openai_api_key or not serper_api_key:
st.error("Please provide the missing API keys in Settings.")
elif not search_query.strip():
st.error("Please provide the search query.")
else:
try:
with st.spinner("Please wait..."):
# 搜索新闻
search = GoogleSerperAPIWrapper(
type="news",
tbs="qdr:w1",
serper_api_key=serper_api_key
)
result_dict = search.results(search_query)
# 尝试从多个字段提取结果
news_results = (
result_dict.get("news")
or result_dict.get("organic")
or result_dict.get("results")
or []
)
if not news_results:
st.error(f"No search results for: {search_query}")
else:
for i, item in zip(range(num_results), news_results):
# 加载网页内容
loader = UnstructuredURLLoader(
urls=[item["link"]],
ssl_verify=False,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_5_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
)
data = loader.load()
# 分段处理,防止文本过长
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
docs = text_splitter.split_documents(data)
# 初始化 LLM(支持自定义 base_url 和 model_name)
llm = ChatOpenAI(
temperature=0,
openai_api_key=openai_api_key,
base_url=st.session_state.config.get("base_url", ""),
model_name=st.session_state.config.get("model", ""),
verbose=True
)
# 使用 map-reduce 提高长文本摘要效果
prompt_template = """Write a concise news summary of the following content in 100-150 words:\n\n{text}"""
prompt = PromptTemplate(
template=prompt_template,
input_variables=["text"]
)
chain = load_summarize_chain(
llm,
chain_type="map_reduce",
map_prompt=prompt,
combine_prompt=prompt
)
summary = chain.run(docs)
# 输出摘要
st.success(f"**Title**: {item['title']}\n\n**Link**: {item['link']}\n\n**Summary**: {summary}")
except Exception as e:
st.exception(f"Exception: {e}")
注意这里用了UnstructuredURLLoader;在URL_Summary中,我测试表明用这个函数总结会出问题。但是这个功能中使用就没有问题。希望有心的同学继续跟进。有时间聊聊。
四、检视与收获
代码检视过一遍,都可以跑通。作为例子体现了langchain在特定场景下的运用。我觉得这份代码,对于使用langchain处理问题的模版,也就是套路没有通过注释清楚的表明。大家可以去看英文官网进一步了解。也可以跟我一起讨论。