大语言模型与数据库结合:在Postgres中使用Ollama进行数据分析

爱的威利斯 2024-09-29 19:28:50

开源大语言模型(LLMs)是开发人员构建AI应用程序的首选,尤其是在检索增强生成(RAG)、搜索和AI代理等领域。与专有模型相比,开源模型具有私密性、免费使用(不包括硬件成本)、可本地运行和易于自定义的优点。然而,开始使用开源LLMs可能会比较复杂。

这时,Ollama的出现解决了这一问题。Ollama被称为“LLMs的Docker”,使开发人员能够轻松访问和使用多种开源模型,如Llama 3、Mistral、Phi 3和Gemma等。此外,Ollama允许在现有机器上本地部署开源LLMs,使开发人员能够快速入门并构建完整的本地AI应用程序。

大语言模型已经阅读了这个星球上所有的数据,但是有一个领域的数据是大模型没有涉及的,那就是我们常规信息系统里的数据库数据。而将大模型应用到企业中的最佳的应用范式是,大语言模型和传统的数据库数据进行结合。

最近一件让人激动人心的消息是:PostgreSQL开发人员现在可以通过pgai扩展在数据库内部访问Ollama模型。Pgai是一个开源PostgreSQL扩展,将AI模型更近地整合到数据中,简化了使用PostgreSQL构建搜索和RAG AI应用程序的过程。

Pgai的功能

pgai将嵌入和生成AI模型带入数据库,开发人员可以直接在PostgreSQL中执行以下操作:

创建数据的向量嵌入。从Ollama、OpenAI等模型中检索LLM聊天补全。在PostgreSQL中的现有关系数据上进行推理,支持分类、摘要和数据丰富等用例。如何开始使用pgai

每个人都可以在PostgreSQL数据库中使用pgai,只需要安装pgai扩展即可。目前Pgai扩展支撑一下的AI提供者:

Ollama - 配置pgai以使用Ollama,使用该模型进行嵌入、聊天补全和生成。OpenAI - 配置pgai以使用OpenAI,进行分词、嵌入、聊天补全和内容审核。Anthropic - 配置pgai以使用Anthropic,生成内容。Cohere - 配置pgai以使用Cohere,进行分词、嵌入、聊天补全、分类和重新排名。使用Ollama和pgai的优势数据隐私:通过在本地运行LLMs,避免将数据发送到第三方,从而更好地控制数据。低成本:除了硬件成本外,不需按令牌付费,适合预算有限的AI应用。本地部署:在主机上本地运行模型,即使在网络连接有限的环境中也能使用。更快的处理速度:本地应用程序访问模型,无需等待网络调用,提高用户体验。定制和微调:可以根据应用需求定制模型,加载自定义模型。示例

示例1:嵌入创建

创建嵌入是构建RAG或搜索应用程序的常见任务。以下示例创建一个编程名言的表,并使用pgai扩展通过Ollama的Llama3模型进行嵌入创建。

-- 安装pgai扩展create extension if not exists ai cascade;-- 创建编程名言表create table quotes( id int not null primary key generated by default as identity, quote text, person text, embedding vector(4096) -- 向量数据类型来自pgvector扩展);insert into quotes (quote, person) values ('What one programmer can do in one month, two programmers can do in two months.', 'Frederick P. Brooks'), ('The only way to learn a new programming language is by writing programs in it.', 'Dennis Ritchie'), ('Talk is cheap. Show me the code.', 'Linus Torvalds');-- 使用Ollama的Llama3模型嵌入内容update quotes set embedding = ollama_embed('llama3', format('%s - %s', person, quote));-- 显示每个嵌入的向量维度select quote, person, vector_dims(embedding) from quotes;

示例2:LLM推理

接下来,使用Ollama对存储在PostgreSQL表中的数据进行推理。以下示例创建一个电影表,并要求LLM总结Robert Downey Jr.的职业生涯。

-- 安装必要的Python包pip install pandas pyarrow fsspec huggingface_hub-- 下载电影数据并转换为CSV文件import pandas as pddf = pd.read_parquet("hf://datasets/Cohere/movies/movies.parquet")df.to_csv('movies.csv');-- 创建PostgreSQL表并加载电影数据create extension if not exists ai cascade;create table movie( id int not null primary key, title text not null, overview text, genres text, producer text, "cast" text);\copy movie from 'movies.csv' with (format csv, header on)with career as( -- 构建多行字符串包含电影列表 select string_agg(format(E'title: %s\noverview: %s\ngenres: %s\n', title, overview, genres), E'\n') as movies from movie where "cast" like '%Robert Downey Jr.%')select ollama_generate( 'llama3', concat(E'Robert Downey Jr. was an actor in the movies listed below. Please summarize his career giving an overview of the types of movies he has starred in.\n', career.movies))->>'response'from career;

大语言模型的返回结果是:

Robert Downey Jr. is an American actor known for his versatility in playing a wide range of characters across various genres. His filmography showcases his ability to adapt to different roles and projects.Downey Jr.'s most notable work has been in the Marvel Cinematic Universe (MCU), where he played Tony Stark/Iron Man in multiple films, including:* Iron Man (2008)* The Avengers (2012)* Iron Man 3 (2013)* The Avengers: Age of Ultron (2015)* Captain America: Civil War (2016)* Spider-Man: Homecoming (2017)* Avengers: Infinity War (2018)* Avengers: Endgame (2019)He has also appeared in other notable films, such as:* Sherlock Holmes (2009) and its sequel Sherlock Holmes: A Game of Shadows (2011), where he played the titular character* Tropic Thunder (2008), a comedy film that parodies Hollywood culture* The Judge (2014), a drama film where he plays a lawyer who returns to his hometown for his mother's funeral* The Shaggy Dog (2006), a family-friendly comedy film based on theic Disney franchiseIn addition to these notable roles, Downey Jr. has also appeared in a wide range of smaller films and independent projects, showcasing his ability to take on diverse characters and genres.Throughout his career, Robert Downey Jr. has demonstrated his versatility as an actor, taking on complex characters, and exploring various genres from drama and comedy to action and science fiction.

示例3:文本转SQL

使用pgai和Ollama执行文本到SQL的转换。以下示例使用ollama_generate函数生成SQL更新语句。

select ollama_generate( 'llama3', $$In PostgreSQL version 16, I have a table defined like this:create table quotes( id int not null primary key generated by default as identity, quote text, person text, embedding vector(4096) -- 向量数据类型来自pgvector扩展);There is a function called "ollama_embed" with a signature like this:ollama_embed(_model text, _prompt text) returns vectorI want to embed the contents of the quote column in the quotes table using the ollama_embed function and the "llama3" model.Write an UPDATE statement to set the embedded column of the quotes table to the vector returned from the ollama_embed function.Your response should ONLY include the SQL statement with NO commentary. Make sure the output is valid SQL for PostgreSQL version 16.$$, _system=>'You are an expert PostgreSQL developer who assists in writing SQL queries.')->>'response';

多次运行此程序可能会产生不同的 SQL 语句,其正确性和质量各不相同。但是,它确实为我们提供了正确的答案:

UPDATE quotes SET embedding = ollama_embed('llama3', quote)开始使用Ollama和Pgai

大语言模型(LLMs)与传统数据库的结合,标志着数据分析和推理领域的重大进步。通过将智能推理和自然语言处理能力融入数据库操作,开发者可以更高效地从复杂数据中提取洞察。这种结合不仅提升了数据访问的灵活性和直观性,还使得非技术用户能够通过自然语言查询数据库,从而促进了跨学科的协作与创新。此外,实时的推理能力能够加速决策过程,使企业更快速地响应市场变化,挖掘潜在机会。总之,这一进步为数据驱动的决策提供了更强大的支持,开启了智能化数据管理的新篇章。

github地址:https://github.com/timescale/pgai

0 阅读:54