基于姿态语音打造超级玛丽新玩法-人工智能-PHP中文网

2021 PaddlePaddle Hackathon 飞桨黑客马拉松，是由飞桨联合深度学习技术及应用国家工程实验室主办，联合 OpenVINO、MLFlow、KubeFlow、TVM 等开源项目共同出品，面向全球开发者的深度学习领域编程活动，旨在鼓励开发者了解与参与深度学习开源项目。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

基于姿态语音打造超级玛丽新玩法 - php中文网

飞桨黑客马拉松比赛介绍

2021 PaddlePaddle Hackathon 飞桨黑客马拉松，是由飞桨联合深度学习技术及应用国家工程实验室主办，联合 OpenVINO、MLFlow、KubeFlow、TVM 等开源项目共同出品，面向全球开发者的深度学习领域编程活动，旨在鼓励开发者了解与参与深度学习开源项目。

参赛项目介绍

本项目基于姿态估计和语音关键词分类模型打造了一款简单实用的人机交互新玩法。

项目演示基于PyGame超级玛丽(PS: 有兴趣的小伙伴可以尝试其他好玩的游戏), 通过姿态估计模型提取几何太特征和运动特征翻译人体姿势指令，整个过程运动量还是比较大，很适合娱乐的同时减肥健身；另一方面运动累了也可以切换到语音模式，让人机交互更接近真实感。

基于本项目小伙伴还可以发挥更多的想象，比如练习外语，健身APP，抑或是用PaddleGAN来点元宇宙的错觉，抑或是玩玩真机网友之类，等等等等....

本项目的GitHub地址: https://github.com/thunder95/Play_Mario_With_PaddlePaddle

注意: 两天参赛时间现撸代码，还存在很多瑕疵，所以本项目还在持续优化过程中，欢迎大家提出宝贵的意见，互相学习交流。

B站视频体验如下：

b站视频链接：https://www.bilibili.com/video/BV1B64y1i7GM

功能模块

超级玛丽游戏

一款载着满满儿时记忆的游戏, 在GitHub已有大佬基于PyGame已经完美复现，作者已经实现到了第4关。

GitHub地址: https://github.com/justinmeister/Mario-Level-1

本项目对于交互部分做了少量的修改，原项目是通过PyGame监听的按键操作，在本项目中将其他模块的指令放到队列中替代按键信号。

人体关键点估计

因人机交互对模型推理的高实时性要求，调研过多个模型之后，最终选型采用的是PaddleDetection开源的PicoDet-S-Pedestrian以及PP-TinyPose，模型推理时间单帧20ms左右，速度和效果都能满足要求。

PP-TinyPose是PaddleDetecion针对移动端设备优化的实时姿态检测模型，可流畅地在移动端设备上执行多人姿态估计任务。借助PaddleDetecion自研的优秀轻量级检测模型PicoDet,我们同时提供了特色的轻量级垂类行人检测模型。

PP-TinyPose 链接: https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/configs/keypoint/tiny_pose

考虑到额外的动作模型会增加指令的延迟，本项目只是将得到的关键点基于坐标信息进行简单的分类，基本也能满足需求。

In [ ]

!git clone  PaddleDetection
%cd PaddleDetection
!python3 deploy/python/det_keypoint_unite_infer.py --det_model_dir=outut_inference/picodet_s_192_pedestrian --keypoint_model_dir=outut_inference/tinypose_128x96 --image_file=demo/000000014439.jpg --device=GPU

登录后复制

语音分类训练

语音样本采集

目前AIStudio不支持在线采集，可以下载代码到本地运行:

!python speech_cmd_cls/generate_data.py

借助PyAudio第三方库, 上述语音采集脚本可自动录制声音，语音只需要采集游戏玩家7个关键字的声音，并以500ms间隔切割保存到对应目录，每个关键字大概录制2~3分钟就够了。时间充分的话，也可以按需扩充样本。

Zyro AI Background Remover

Zyro推出的AI图片背景移除工具

查看详情

语音数据清洗

对于无声的、电流声的、或是听起来不清晰的录音片段，需要移动到第8个目录(名称: 其他)

语音数据预处理

借助第三方库librosa, 加载音频文件，提取melspectrogram特征，并过滤掉一些低分贝音频帧。

!python speech_cmd_cls/preprocess.py

ps: 文件夹下speech_cmd_cls/data是录制的作者的语音，方便大家测试。

In [ ]

#数据预处理!unzip speech_cmd_cls.zip%cd  speech_cmd_cls/
!python preprocess.py

登录后复制

/home/aistudio/speech_cmd_cls
标签名： ['左', '右', '下', '停', '跑', '跳', '打', '其它']
preprocess data finished

登录后复制

In [ ]

#简单搭建一个自定义带注意力的LSTM网络结构from paddle import nnclass SpeechCommandModel(nn.Layer):
    def __init__(self, num_classes=10):
        super(SpeechCommandModel, self).__init__()
        self.conv1 =  nn.Conv2D(126, 10, (5, 1), padding="SAME")
        self.relu1 = nn.ReLU()
        self.bn1 = nn.BatchNorm2D(10)
        self.conv2 =  nn.Conv2D(10, 1, (5, 1), padding="SAME")
        self.relu2 = nn.ReLU()
        self.bn2 = nn.BatchNorm2D(1)
        self.lstm1 = nn.LSTM(input_size=80, 
                                   hidden_size=64, 
                                   direction="bidirect")
        self.lstm2 = nn.LSTM(input_size=128, 
                                   hidden_size=64, 
                                   direction="bidirect")
        self.query = nn.Linear(128, 128)
        self.softmax = nn.Softmax(axis=-1)
        self.fc1 = nn.Linear(128, 64)
        self.fc1_relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 32)
        self.classifier = nn.Linear(32, num_classes)
        self.cls_softmax = nn.Softmax(axis=-1)    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.bn1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.bn2(x)
        x = x.squeeze(axis=-1)
        x, _ = self.lstm1(x)
        x, _ = self.lstm2(x)
        x = x.squeeze(axis=1)
        q = self.query(x)
        attScores  = paddle.matmul(q, x, transpose_y=True)
        attScores = self.softmax(attScores)
        attVector = paddle.matmul(attScores, x)
        output = self.fc1(attVector)
        output = self.fc1_relu(output)
        output = self.fc2(output)
        output = self.classifier(output)
        output = self.cls_softmax(output)        return output

model = SpeechCommandModel(num_classes = 8)print(model)

登录后复制

SpeechCommandModel(
  (conv1): Conv2D(126, 10, kernel_size=[5, 1], padding=SAME, data_format=NCHW)
  (relu1): ReLU()
  (bn1): BatchNorm2D(num_features=10, momentum=0.9, epsilon=1e-05)
  (conv2): Conv2D(10, 1, kernel_size=[5, 1], padding=SAME, data_format=NCHW)
  (relu2): ReLU()
  (bn2): BatchNorm2D(num_features=1, momentum=0.9, epsilon=1e-05)
  (lstm1): LSTM(80, 64
    (0): BiRNN(
      (cell_fw): LSTMCell(80, 64)
      (cell_bw): LSTMCell(80, 64)
    )
  )
  (lstm2): LSTM(128, 64
    (0): BiRNN(
      (cell_fw): LSTMCell(128, 64)
      (cell_bw): LSTMCell(128, 64)
    )
  )
  (query): Linear(in_features=128, out_features=128, dtype=float32)
  (softmax): Softmax(axis=-1)
  (fc1): Linear(in_features=128, out_features=64, dtype=float32)
  (fc1_relu): ReLU()
  (fc2): Linear(in_features=64, out_features=32, dtype=float32)
  (classifier): Linear(in_features=32, out_features=8, dtype=float32)
  (cls_softmax): Softmax(axis=-1)
)

登录后复制

模型训练

使用飞桨的高层API对语音网络进行训练, 训练的准确率在95%左右

即使没有GPU在飞桨框架下训练这个小网络也非常的快。

!python speech_cmd_cls/train.py

In [18]

!python train.py

登录后复制

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/20
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance.")
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9538 - 17ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.6995 - acc: 0.9657 - 6ms/step        
Eval samples: 175
Epoch 2/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9551 - 16ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5585 - acc: 0.9714 - 6ms/step         
Eval samples: 175
Epoch 3/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9525 - 16ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.4175 - acc: 0.9771 - 6ms/step        
Eval samples: 175
Epoch 4/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9564 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5593 - acc: 0.9714 - 6ms/step         
Eval samples: 175
Epoch 5/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9538 - 13ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.3246 - acc: 0.9714 - 5ms/step         
Eval samples: 175
Epoch 6/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9447 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5576 - acc: 0.9714 - 6ms/step        
Eval samples: 175
Epoch 7/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9460 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.4488 - acc: 0.9714 - 6ms/step        
Eval samples: 175
Epoch 8/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9525 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.7026 - acc: 0.9429 - 6ms/step         
Eval samples: 175
Epoch 9/20
step 193/193 [==============================] - loss: 1.7740 - acc: 0.9389 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.7024 - acc: 0.9486 - 6ms/step        
Eval samples: 175
Epoch 10/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9460 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5597 - acc: 0.9543 - 6ms/step        
Eval samples: 175
Epoch 11/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9467 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5596 - acc: 0.9657 - 6ms/step         
Eval samples: 175
Epoch 12/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9506 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5625 - acc: 0.9714 - 6ms/step         
Eval samples: 175
Epoch 13/20
step 193/193 [==============================] - loss: 1.7740 - acc: 0.9571 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.5593 - acc: 0.9657 - 6ms/step        
Eval samples: 175
Epoch 14/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9525 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.6989 - acc: 0.9600 - 6ms/step        
Eval samples: 175
Epoch 15/20
step 193/193 [==============================] - loss: 1.7740 - acc: 0.9512 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.8454 - acc: 0.9543 - 6ms/step        
Eval samples: 175
Epoch 16/20
step 193/193 [==============================] - loss: 1.7740 - acc: 0.9473 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.7026 - acc: 0.9543 - 6ms/step         
Eval samples: 175
Epoch 17/20
step 193/193 [==============================] - loss: 1.2741 - acc: 0.9519 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.3661 - acc: 0.9771 - 6ms/step         
Eval samples: 175
Epoch 18/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9590 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.4335 - acc: 0.9714 - 6ms/step         
Eval samples: 175
Epoch 19/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9590 - 14ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.6870 - acc: 0.9657 - 6ms/step         
Eval samples: 175
Epoch 20/20
step 193/193 [==============================] - loss: 1.2740 - acc: 0.9545 - 15ms/step        
Eval begin...
step 22/22 [==============================] - loss: 1.6629 - acc: 0.9486 - 6ms/step         
Eval samples: 175

登录后复制

模型评估和预测

训练完成可以对模型进行初步评估，也可以线下使用麦克风对模型效果进行实时验证

!python speech_cmd_cls/eval.py

!python speech_cmd_cls/realtime_infer.py

特别注意: 即使在验证集上训练出效果不错的模型，但是在这个小网络和小数据集上泛化能力相对较弱，当更换设备，更换说话人，或是更换到不同噪音背景的环境，效果可能会有些不理想。

In [20]

!python eval.py

登录后复制

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Eval begin...
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and
step 3/3 - loss: 1.3763 - acc: 0.9543 - 27ms/step
Eval samples: 175
{'loss': [1.3763338], 'acc': 0.9542857142857143}

登录后复制

以上就是基于姿态语音打造超级玛丽新玩法的详细内容，更多请关注php中文网其它相关文章！