使用OpenAI构建实时音频聊天：语音识别和文本转语音集成

在本教程中，我们将通过使用FastAPI和WebSocket来引导您创建一个实时音频聊天应用程序的过程。该应用程序允许用户通过提供音频输入与聊天助手进行对话，然后使用语音识别将其转换成文本进行处理。生成的文本将被馈送到GPT-4模型用于自然语言理解，并通过文本转语音功能将助手的回复传达给用户。

设置环境

开始之前，请确保你已经安装了必要的依赖项。你可以使用以下命令安装所需的Python软件包：

pip install fastapi uvicorn openai

同时，获取您的OpenAI API密钥。

使用FastAPI和OpenAI进行后端开发

应用程序使用FastAPI构建，这是一个现代化、快速的用于用Python构建API的Web框架。此外，它还集成了OpenAI的GPT-3.5 Turbo模型，用于生成用户查询的响应。

from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request
from fastapi.responses import FileResponse
import openai
import time
import os

FastAPI配置

创建了FastAPI应用程序，并设置了必要的环境变量，包括OpenAI API密钥。

app = FastAPI()
os.environ["OPENAI_API_KEY"] = "Your_API_key"
client = openai.OpenAI()
start_time = time.time()

连接管理器

定义一个ConnectionManager类来管理活动的WebSocket连接。

class ConnectionManager:
    def __init__(self):
        self.active_connections = []
    async def connect(self, websocket: WebSocket):
        await websocket.accept()
        self.active_connections.append(websocket)
    def disconnect(self, websocket: WebSocket):
        self.active_connections.remove(websocket)
    async def send_text(self, text: str, websocket: WebSocket):
        await websocket.send_text(text)
manager = ConnectionManager()

WebSocket 终端

主要的WebSocket终端被定义为处理与客户端的实时通信。

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await manager.connect(websocket)
    try:
        while True:
            try:
                # Receive text data (speech recognition result) from the client
                data = await websocket.receive_text()
                
                # Process the data
                print(f"Received text: {data}")  # Example: print it to the console
                res = call_open_api(data)
                # Optionally, send a response back to the client
                collected_chunks = []
                collected_messages = []
                # iterate through the stream of events
                for chunk in res:
                    chunk_time = time.time() - start_time  # calculate the time delay of the chunk
                    collected_chunks.append(chunk)  # save the event response
                    chunk_message = chunk.choices[0].delta.content  # extract the message
                    collected_messages.append(chunk_message)  # save the message
                    
                    
                    if chunk_message is not None and chunk_message.find('.') != -1:
                        print("Found full stop")
                        message = [m for m in collected_messages if m is not None]
                        full_reply_content = ''.join([m for m in message])

                        await manager.send_text(full_reply_content, websocket)
                        collected_messages = []

                    print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}")  # print the delay and text

                if len(collected_messages) > 0:
                    message = [m for m in collected_messages if m is not None]
                    full_reply_content = ''.join([m for m in message])

                    await manager.send_text(full_reply_content, websocket)
                    collected_messages = []
                
            except WebSocketDisconnect:
                manager.disconnect(websocket)
                break
            except Exception as e:
                # Handle other exceptions
                print(f"Error: {str(e)}")
                break
    finally:
        manager.disconnect(websocket)

提供HTML页面

提供额外的 API 端点，为客户端交互提供一个 HTML 页面。

@app.get("/")
async def get():
    return FileResponse("voice_frontend.html")

HTML 结构

HTML 结构定义了网页的布局，包括标题、一个“开始连续识别”按钮和一个状态指示器。

<!DOCTYPE html>
<html>
<head>
    <title>Real-time Speech Recognition and Text-to-Speech</title>
    <style>
        /* Styles for the page */
    </style>
</head>
<body>
    <h1>Audio-only Chat</h1>
    <button id="startButton">Start Continuous Recognition</button>
    <p id="status"></p>
<!-- JavaScript code for speech recognition and WebSocket communication -->
    <script>
        // JavaScript code
    </script>
</body>
</html>

JavaScript 代码

JavaScript 代码定义了语音识别和 WebSocket 通讯的逻辑。它利用 Web Speech API 进行语音识别和文本转语音功能。WebSocket 连接被建立起来用于与 FastAPI 服务器进行通信。

const startButton = document.getElementById('startButton');
const status = document.getElementById('status');
let ws;
let recognition;

function startRecognition() {
    if ('webkitSpeechRecognition' in window) {
        recognition = new webkitSpeechRecognition();
        recognition.continuous = true;
        recognition.interimResults = false;

        recognition.onstart = () => {
            status.innerText = 'Speech recognition is on. Speak into the microphone.';
        };

        recognition.onresult = (event) => {
            let transcript = event.results[event.resultIndex][0].transcript;
            // Stop text-to-speech
            window.speechSynthesis.cancel();
            ws.send(transcript);
        };

        recognition.onerror = (event) => {
            status.innerText = 'Speech recognition error: ' + event.error;
        };

        recognition.onend = () => {
            recognition.start();
        };

        recognition.start();
    } else {
        status.innerText = 'Your browser does not support Web Speech API.';
    }
}

function speakText(text) {
    let speech = new SpeechSynthesisUtterance(text);
    window.speechSynthesis.speak(speech);
}

startButton.onclick = () => {
    ws = new WebSocket('wss://127.0.0.1:8000/ws');
    ws.onopen = () => {
        startRecognition();
    };
    ws.onmessage = (event) => {
        speakText(event.data);
    };
    ws.onerror = (event) => {
        console.error('WebSocket error:', event);
    };
    ws.onclose = () => {
        recognition.stop();
        status.innerText = 'WebSocket disconnected.';
    };
};

Github链接: https://github.com/SagarDangal/audio_chat_with_openAI

结论

通过结合 FastAPI、WebSocket 和 OpenAI 的力量，您已经创建了一个实时音频聊天应用程序。用户可以使用口语与聊天助手进行自然对话，助手则会以语境相关的文本作出回应并转换为语音。本教程提供了一个基础，您可以在此基础上进行扩展，以进一步定制和增强您的实时音频聊天应用程序的功能。探索更多功能，优化用户界面，并定制聊天助手的行为，以创建独特和引人入胜的用户体验。