使用OpenAI构建实时音频聊天:语音识别和文本转语音集成

在本教程中,我们将通过使用FastAPI和WebSocket来引导您创建一个实时音频聊天应用程序的过程。该应用程序允许用户通过提供音频输入与聊天助手进行对话,然后使用语音识别将其转换成文本进行处理。生成的文本将被馈送到GPT-4模型用于自然语言理解,并通过文本转语音功能将助手的回复传达给用户。

设置环境

开始之前,请确保你已经安装了必要的依赖项。你可以使用以下命令安装所需的Python软件包:

pip install fastapi uvicorn openai

同时,获取您的OpenAI API密钥。

使用FastAPI和OpenAI进行后端开发

应用程序使用FastAPI构建,这是一个现代化、快速的用于用Python构建API的Web框架。此外,它还集成了OpenAI的GPT-3.5 Turbo模型,用于生成用户查询的响应。

from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request
from fastapi.responses import FileResponse
import openai
import time
import os

FastAPI配置

创建了FastAPI应用程序,并设置了必要的环境变量,包括OpenAI API密钥。

app = FastAPI()
os.environ["OPENAI_API_KEY"] = "Your_API_key"
client = openai.OpenAI()
start_time = time.time()

连接管理器

定义一个ConnectionManager类来管理活动的WebSocket连接。

class ConnectionManager:
def __init__(self):
self.active_connections = []
async def connect(self, websocket: WebSocket):
await websocket.accept()
self.active_connections.append(websocket)
def disconnect(self, websocket: WebSocket):
self.active_connections.remove(websocket)
async def send_text(self, text: str, websocket: WebSocket):
await websocket.send_text(text)
manager = ConnectionManager()

WebSocket 终端

主要的WebSocket终端被定义为处理与客户端的实时通信。

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await manager.connect(websocket)
try:
while True:
try:
# Receive text data (speech recognition result) from the client
data = await websocket.receive_text()

# Process the data
print(f"Received text: {data}") # Example: print it to the console
res = call_open_api(data)
# Optionally, send a response back to the client
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in res:
chunk_time = time.time() - start_time # calculate the time delay of the chunk
collected_chunks.append(chunk) # save the event response
chunk_message = chunk.choices[0].delta.content # extract the message
collected_messages.append(chunk_message) # save the message


if chunk_message is not None and chunk_message.find('.') != -1:
print("Found full stop")
message = [m for m in collected_messages if m is not None]
full_reply_content = ''.join([m for m in message])

await manager.send_text(full_reply_content, websocket)
collected_messages = []

print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}") # print the delay and text

if len(collected_messages) > 0:
message = [m for m in collected_messages if m is not None]
full_reply_content = ''.join([m for m in message])

await manager.send_text(full_reply_content, websocket)
collected_messages = []

except WebSocketDisconnect:
manager.disconnect(websocket)
break
except Exception as e:
# Handle other exceptions
print(f"Error: {str(e)}")
break
finally:
manager.disconnect(websocket)

提供HTML页面

提供额外的 API 端点,为客户端交互提供一个 HTML 页面。

@app.get("/")
async def get():
return FileResponse("voice_frontend.html")

HTML 结构

HTML 结构定义了网页的布局,包括标题、一个“开始连续识别”按钮和一个状态指示器。

<!DOCTYPE html>
<html>
<head>
<title>Real-time Speech Recognition and Text-to-Speech</title>
<style>
/* Styles for the page */
</style>
</head>
<body>
<h1>Audio-only Chat</h1>
<button id="startButton">Start Continuous Recognition</button>
<p id="status"></p>
<!-- JavaScript code for speech recognition and WebSocket communication -->
<script>
// JavaScript code
</script>
</body>
</html>

JavaScript 代码

JavaScript 代码定义了语音识别和 WebSocket 通讯的逻辑。它利用 Web Speech API 进行语音识别和文本转语音功能。WebSocket 连接被建立起来用于与 FastAPI 服务器进行通信。

const startButton = document.getElementById('startButton');
const status = document.getElementById('status');
let ws;
let recognition;

function startRecognition() {
if ('webkitSpeechRecognition' in window) {
recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false;

recognition.onstart = () => {
status.innerText = 'Speech recognition is on. Speak into the microphone.';
};

recognition.onresult = (event) => {
let transcript = event.results[event.resultIndex][0].transcript;
// Stop text-to-speech
window.speechSynthesis.cancel();
ws.send(transcript);
};

recognition.onerror = (event) => {
status.innerText = 'Speech recognition error: ' + event.error;
};

recognition.onend = () => {
recognition.start();
};

recognition.start();
} else {
status.innerText = 'Your browser does not support Web Speech API.';
}
}

function speakText(text) {
let speech = new SpeechSynthesisUtterance(text);
window.speechSynthesis.speak(speech);
}

startButton.onclick = () => {
ws = new WebSocket('wss://127.0.0.1:8000/ws');
ws.onopen = () => {
startRecognition();
};
ws.onmessage = (event) => {
speakText(event.data);
};
ws.onerror = (event) => {
console.error('WebSocket error:', event);
};
ws.onclose = () => {
recognition.stop();
status.innerText = 'WebSocket disconnected.';
};
};

Github链接: https://github.com/SagarDangal/audio_chat_with_openAI

结论

通过结合 FastAPI、WebSocket 和 OpenAI 的力量,您已经创建了一个实时音频聊天应用程序。用户可以使用口语与聊天助手进行自然对话,助手则会以语境相关的文本作出回应并转换为语音。本教程提供了一个基础,您可以在此基础上进行扩展,以进一步定制和增强您的实时音频聊天应用程序的功能。探索更多功能,优化用户界面,并定制聊天助手的行为,以创建独特和引人入胜的用户体验。

2024-01-10 04:32:41 AI中文站翻译自原文