🤶🏼 ☮️ ⚜️ 用Python编写语音助手 🔌 🎫 🥚

介绍

过去一年中，机器学习技术以惊人的速度发展。越来越多的公司正在共享其最佳实践，从而为创建智能数字助理开辟了新的可能性。

作为本文的一部分，我想分享我在实现语音助手方面的经验，并向您提供一些想法，以使其变得更加智能和有用。

?

	offline-
		pip install PyAudio ( ) pip install pyttsx3 ( ) : pip install SpeechRecognition ( online-, ) pip install vosk ( offline-, )
		pip install pyowm (OpenWeatherMap)
Google ( )		pip install google
YouTube		-
Wikipedia c		pip install wikipedia-api
将短语从目标语言翻译成用户的母语，反之亦然	不支持	pip install googletrans（谷歌翻译）
在社交网络上按名字和姓氏搜索人	不支持	--
“抛硬币”	支持的	--
问候并说再见（再见后，应用程序结束）	支持的	--
随时更改语音识别和合成语言设置	支持的	--
待办事项更多...

步骤1.处理语音输入

让我们从学习如何处理语音输入开始。我们需要一个麦克风和几个已安装的库：PyAudio和SpeechRecognition。

让我们准备语音识别的基本工具：

import speech_recognition

if __name__ == "__main__":

    #      
    recognizer = speech_recognition.Recognizer()
    microphone = speech_recognition.Microphone()

    while True:
        #         
        voice_input = record_and_recognize_audio()
        print(voice_input)

现在，让我们创建一个用于记录和识别语音的功能。对于在线识别，我们需要Google，因为它在多种语言中具有很高的识别质量。

def record_and_recognize_audio(*args: tuple):
    """
       
    """
    with microphone:
        recognized_data = ""

        #    
        recognizer.adjust_for_ambient_noise(microphone, duration=2)

        try:
            print("Listening...")
            audio = recognizer.listen(microphone, 5, 5)

        except speech_recognition.WaitTimeoutError:
            print("Can you check if your microphone is on, please?")
            return

        #  online-  Google 
        try:
            print("Started recognition...")
            recognized_data = recognizer.recognize_google(audio, language="ru").lower()

        except speech_recognition.UnknownValueError:
            pass

        #          
        except speech_recognition.RequestError:
            print("Check your Internet Connection, please")

        return recognized_data

如果无法访问互联网怎么办？您可以使用解决方案进行离线识别。我个人真的很喜欢Vosk项目。

实际上，如果您不需要离线选项，则无需实施。我只是想在本文的框架内展示这两种方法，并且您已经根据系统要求进行了选择（例如，毫无疑问，谷歌是可用识别语言数量的领先者）。

现在，在实现了脱机解决方案并将必需的语言模型添加到项目后，在无法访问网络的情况下，我们将自动切换到脱机识别。

请注意，为了不必重复相同的短语两次，我决定将来自麦克风的音频录制到一个临时的wav文件中，该文件将在每次识别后删除。

因此，结果代码如下所示：

语音识别的完整代码可以正常工作

from vosk import Model, KaldiRecognizer  # -  Vosk
import speech_recognition  #    (Speech-To-Text)
import wave  #      wav
import json  #   json-  json-
import os  #    


def record_and_recognize_audio(*args: tuple):
    """
       
    """
    with microphone:
        recognized_data = ""

        #    
        recognizer.adjust_for_ambient_noise(microphone, duration=2)

        try:
            print("Listening...")
            audio = recognizer.listen(microphone, 5, 5)

            with open("microphone-results.wav", "wb") as file:
                file.write(audio.get_wav_data())

        except speech_recognition.WaitTimeoutError:
            print("Can you check if your microphone is on, please?")
            return

        #  online-  Google 
        try:
            print("Started recognition...")
            recognized_data = recognizer.recognize_google(audio, language="ru").lower()

        except speech_recognition.UnknownValueError:
            pass

        #          
        #  offline-  Vosk
        except speech_recognition.RequestError:
            print("Trying to use offline recognition...")
            recognized_data = use_offline_recognition()

        return recognized_data


def use_offline_recognition():
    """
      - 
    :return:  
    """
    recognized_data = ""
    try:
        #         
        if not os.path.exists("models/vosk-model-small-ru-0.4"):
            print("Please download the model from:\n"
                  "https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
            exit(1)

        #      (   )
        wave_audio_file = wave.open("microphone-results.wav", "rb")
        model = Model("models/vosk-model-small-ru-0.4")
        offline_recognizer = KaldiRecognizer(model, wave_audio_file.getframerate())

        data = wave_audio_file.readframes(wave_audio_file.getnframes())
        if len(data) > 0:
            if offline_recognizer.AcceptWaveform(data):
                recognized_data = offline_recognizer.Result()

                #      JSON-
                # (      )
                recognized_data = json.loads(recognized_data)
                recognized_data = recognized_data["text"]
    except:
        print("Sorry, speech service is unavailable. Try again later")

    return recognized_data


if __name__ == "__main__":

    #      
    recognizer = speech_recognition.Recognizer()
    microphone = speech_recognition.Microphone()

    while True:
        #        
        #      
        voice_input = record_and_recognize_audio()
        os.remove("microphone-results.wav")
        print(voice_input)

您可能会问“为什么支持离线功能？”

在我看来，总是值得考虑的是，该用户可能会与网络断开连接。在这种情况下，如果您将语音助手用作对话型机器人或解决许多简单任务，例如，数点东西，推荐电影，帮助选择厨房，玩游戏等，则语音助手仍然有用。

步骤2.配置语音助手

由于我们的语音助手可以具有性别，语言和根据传统的名称，因此让我们为该数据选择一个单独的类别，以后我们将使用该类别。

为了为助手设置声音，我们将使用pyttsx3离线语音合成库。它将根据操作系统的设置自动在我们的计算机上找到可用于合成的声音（因此，可能会有其他声音可用，并且您将需要不同的索引）。

我们还将在主要功能中添加语音合成的初始化和一个单独的播放功能。为了确保一切正常，让我们检查一下用户是否向我们致意，并向助手打招呼：

语音助手框架的完整代码（语音合成和识别）

from vosk import Model, KaldiRecognizer  # -  Vosk
import speech_recognition  #    (Speech-To-Text)
import pyttsx3  #   (Text-To-Speech)
import wave  #      wav
import json  #   json-  json-
import os  #    


class VoiceAssistant:
    """
      ,  , ,  
    """
    name = ""
    sex = ""
    speech_language = ""
    recognition_language = ""


def setup_assistant_voice():
    """
        (    
        )
    """
    voices = ttsEngine.getProperty("voices")

    if assistant.speech_language == "en":
        assistant.recognition_language = "en-US"
        if assistant.sex == "female":
            # Microsoft Zira Desktop - English (United States)
            ttsEngine.setProperty("voice", voices[1].id)
        else:
            # Microsoft David Desktop - English (United States)
            ttsEngine.setProperty("voice", voices[2].id)
    else:
        assistant.recognition_language = "ru-RU"
        # Microsoft Irina Desktop - Russian
        ttsEngine.setProperty("voice", voices[0].id)


def play_voice_assistant_speech(text_to_speech):
    """
         (  )
    :param text_to_speech: ,     
    """
    ttsEngine.say(str(text_to_speech))
    ttsEngine.runAndWait()


def record_and_recognize_audio(*args: tuple):
    """
       
    """
    with microphone:
        recognized_data = ""

        #    
        recognizer.adjust_for_ambient_noise(microphone, duration=2)

        try:
            print("Listening...")
            audio = recognizer.listen(microphone, 5, 5)

            with open("microphone-results.wav", "wb") as file:
                file.write(audio.get_wav_data())

        except speech_recognition.WaitTimeoutError:
            print("Can you check if your microphone is on, please?")
            return

        #  online-  Google 
        # (  )
        try:
            print("Started recognition...")
            recognized_data = recognizer.recognize_google(audio, language="ru").lower()

        except speech_recognition.UnknownValueError:
            pass

        #         
        #   offline-  Vosk
        except speech_recognition.RequestError:
            print("Trying to use offline recognition...")
            recognized_data = use_offline_recognition()

        return recognized_data


def use_offline_recognition():
    """
      - 
    :return:  
    """
    recognized_data = ""
    try:
        #         
        if not os.path.exists("models/vosk-model-small-ru-0.4"):
            print("Please download the model from:\n"
                  "https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
            exit(1)

        #      (   )
        wave_audio_file = wave.open("microphone-results.wav", "rb")
        model = Model("models/vosk-model-small-ru-0.4")
        offline_recognizer = KaldiRecognizer(model, wave_audio_file.getframerate())

        data = wave_audio_file.readframes(wave_audio_file.getnframes())
        if len(data) > 0:
            if offline_recognizer.AcceptWaveform(data):
                recognized_data = offline_recognizer.Result()

                #      JSON- 
                # (      )
                recognized_data = json.loads(recognized_data)
                recognized_data = recognized_data["text"]
    except:
        print("Sorry, speech service is unavailable. Try again later")

    return recognized_data


if __name__ == "__main__":

    #      
    recognizer = speech_recognition.Recognizer()
    microphone = speech_recognition.Microphone()

    #    
    ttsEngine = pyttsx3.init()

    #    
    assistant = VoiceAssistant()
    assistant.name = "Alice"
    assistant.sex = "female"
    assistant.speech_language = "ru"

    #    
    setup_assistant_voice()

    while True:
        #        
        #      
        voice_input = record_and_recognize_audio()
        os.remove("microphone-results.wav")
        print(voice_input)

        #      ()
        voice_input = voice_input.split(" ")
        command = voice_input[0]

        if command == "":
            play_voice_assistant_speech("")

实际上，在这里我想学习如何自己编写语音合成器，但是我的知识还不够。如果您可以推荐好的文学作品，一门课程或一个有趣的书面解决方案来帮助您深入了解该主题，请在评论中写下。

步骤3.命令处理

既然我们已经“学会了”在同事的神圣发展的帮助下识别和合成语音，那么我们就可以开始重新发明轮子来处理用户的语音命令：D

在我的情况下，我使用多语言选项来存储命令，因为我没有那么多选项事件，我对一个或另一个命令的定义的准确性感到满意。但是，对于大型项目，我建议按语言拆分配置。

我可以提供两种存储命令的方式。

1路

您可以使用一个类似JSON的出色对象，在其中存储意图，开发场景和尝试失败时的响应（这些常用于聊天机器人）。看起来像这样：

config = {
    "intents": {
        "greeting": {
            "examples": ["", "", " ",
                         "hello", "good morning"],
            "responses": play_greetings
        },
        "farewell": {
            "examples": ["", " ", "", " ",
                         "goodbye", "bye", "see you soon"],
            "responses": play_farewell_and_quit
        },
        "google_search": {
            "examples": ["  ",
                         "search on google", "google", "find on google"],
            "responses": search_for_term_on_google
        },
    },
    "failure_phrases": play_failure_phrase
}

此选项适用于想要培训助手以应对困难短语的人。此外，您可以在此处应用NLU方法并创建预测用户意图的功能，并对照配置中的意图进行检查。

我们将在本文的第5步中详细考虑此方法。同时，我将提请您注意一个更简单的选项。

2种方法

您可以使用一个简化的字典，该字典将具有一个可散列的类型元组作为键（因为字典使用散列来快速存储和检索元素），并且将要执行的函数的名称将采用值的形式。对于短命令，以下选项适用：

commands = {
    ("hello", "hi", "morning", ""): play_greetings,
    ("bye", "goodbye", "quit", "exit", "stop", ""): play_farewell_and_quit,
    ("search", "google", "find", ""): search_for_term_on_google,
    ("video", "youtube", "watch", ""): search_for_video_on_youtube,
    ("wikipedia", "definition", "about", "", ""): search_for_definition_on_wikipedia,
    ("translate", "interpretation", "translation", "", "", ""): get_translation,
    ("language", ""): change_language,
    ("weather", "forecast", "", ""): get_weather_forecast,
}

要对其进行处理，我们需要添加如下代码：

def execute_command_with_name(command_name: str, *args: list):
    """
          
    :param command_name:  
    :param args: ,     
    :return:
    """
    for key in commands.keys():
        if command_name in key:
            commands[key](*args)
        else:
            pass  # print("Command not found")


if __name__ == "__main__":

    #      
    recognizer = speech_recognition.Recognizer()
    microphone = speech_recognition.Microphone()

    while True:
        #        
        #      
        voice_input = record_and_recognize_audio()
        os.remove("microphone-results.wav")
        print(voice_input)

        #      ()
        voice_input = voice_input.split(" ")
        command = voice_input[0]
        command_options = [str(input_part) for input_part in voice_input[1:len(voice_input)]]
        execute_command_with_name(command, command_options)

附加参数将在命令字之后传递给函数。也就是说，如果您说短语“视频是可爱的猫”，则命令“视频”将使用参数“可爱的猫”调用search_for_video_on_youtube（）函数，并给出以下结果：

处理输入参数的函数示例：

def search_for_video_on_youtube(*args: tuple):
    """
       YouTube       
    :param args:   
    """
    if not args[0]: return
    search_term = " ".join(args[0])
    url = "https://www.youtube.com/results?search_query=" + search_term
    webbrowser.get().open(url)

    #       
    #  ,      JSON-
    play_voice_assistant_speech("Here is what I found for " + search_term + "on youtube")

而已！该机器人的主要功能已准备就绪。然后，您可以通过各种方式不断改进它。带有详细注释的实现可在我的GitHub上找到。

下面我们将看一些改进，以使我们的助手变得更加聪明。

步骤4.添加多种语言

为了教我们的助手使用多种语言模型，用一个简单的结构组织一个小的JSON文件将是最方便的：

{
  "Can you check if your microphone is on, please?": {
    "ru": ", ,   ",
    "en": "Can you check if your microphone is on, please?"
  },
  "What did you say again?": {
    "ru": ", ",
    "en": "What did you say again?"
  },
}

就我而言，我使用俄语和英语之间的切换，因为为此我可以使用语音识别和语音合成模型。将根据语音助手本身的语音语言来选择语言。

为了接收翻译，我们可以使用方法创建一个单独的类，该方法将向我们返回包含翻译的字符串：

class Translation:
    """
           
      
    """
    with open("translations.json", "r", encoding="UTF-8") as file:
        translations = json.load(file)


    def get(self, text: str):
        """
                (  )
        :param text: ,   
        :return:     
        """
        if text in self.translations:
            return self.translations[text][assistant.speech_language]
        else:
            #        
            #        
            print(colored("Not translated phrase: {}".format(text), "red"))
            return text

在主函数中，在循环之前，我们按以下方式声明翻译器：translator = Translation（）

现在，在播放助手的语音时，我们可以按以下方式获得翻译：

play_voice_assistant_speech(translator.get(
    "Here is what I found for {} on Wikipedia").format(search_term))

从上面的示例中可以看到，即使对于那些需要插入其他参数的行，此方法也适用。因此，您可以为助手翻译“标准”的短语集。

步骤5.一点机器学习

现在，让我们回到用于存储多字命令的JSON对象，这是大多数聊天机器人所特有的，我在第3段中提到过。它适合那些不想使用严格命令并打算使用NLU扩展对用户意图的理解的人-方法。

粗略地说，在这种情况下，短语“下午好”，“晚上好”和“早上好”将被视为等同。助手将理解，在所有三种情况下，用户的意图都是向其语音助手打招呼。

使用此方法，您还可以创建用于聊天的对话机器人或语音助手的对话模式（对于需要对话者的情况）。

为了实现这种可能性，我们将需要添加一些功能：

def prepare_corpus():
    """
         
    """
    corpus = []
    target_vector = []
    for intent_name, intent_data in config["intents"].items():
        for example in intent_data["examples"]:
            corpus.append(example)
            target_vector.append(intent_name)

    training_vector = vectorizer.fit_transform(corpus)
    classifier_probability.fit(training_vector, target_vector)
    classifier.fit(training_vector, target_vector)


def get_intent(request):
    """
            
    :param request:  
    :return:   
    """
    best_intent = classifier.predict(vectorizer.transform([request]))[0]

    index_of_best_intent = list(classifier_probability.classes_).index(best_intent)
    probabilities = classifier_probability.predict_proba(vectorizer.transform([request]))[0]

    best_intent_probability = probabilities[index_of_best_intent]

    #        
    if best_intent_probability > 0.57:
        return best_intent

并且还通过添加变量的初始化来略微修改main函数，以准备模型并将循环更改为与新配置相对应的版本：

#         
# ( )
vectorizer = TfidfVectorizer(analyzer="char", ngram_range=(2, 3))
classifier_probability = LogisticRegression()
classifier = LinearSVC()
prepare_corpus()

while True:
    #         
    #      
    voice_input = record_and_recognize_audio()

    if os.path.exists("microphone-results.wav"):
        os.remove("microphone-results.wav")

    print(colored(voice_input, "blue"))

    #      ()
    if voice_input:
        voice_input_parts = voice_input.split(" ")

        #      -    
        #   
        if len(voice_input_parts) == 1:
            intent = get_intent(voice_input)
            if intent:
                config["intents"][intent]["responses"]()
            else:
                config["failure_phrases"]()

        #     -     
        #     ,
        #     
        if len(voice_input_parts) > 1:
            for guess in range(len(voice_input_parts)):
                intent = get_intent((" ".join(voice_input_parts[0:guess])).strip())
                if intent:
                    command_options = [voice_input_parts[guess:len(voice_input_parts)]]
                    config["intents"][intent]["responses"](*command_options)
                    break
                if not intent and guess == len(voice_input_parts)-1:
                    config["failure_phrases"]()

但是，此方法更难控制：它需要不断验证该短语还是该短语仍被系统正确识别为特定意图的一部分。因此，应谨慎使用此方法（或对模型本身进行试验）。

结论

我的小教程到此结束。

如果您与我分享可以在该项目中实现的注释开源解决方案，以及您对可以实现其他在线和离线功能的想法，我将非常高兴。

我的语音助手的文档来源有两个版本，可以在这里找到。

PS：解决方案可在Windows，Linux和MacOS上运行，安装PyAudio和Google库时会有细微差别。

用Python编写语音助手

介绍

?