Vlog

misaraty 收录于杂谈

2026-04-28 约 1802 字预计阅读 4 分钟

前言

介绍一些Vlog剪辑资源，免版权优先，此处下载。

LUT

背景音乐

Pixabay

ASR/TTS

VibeVoice

VibeVoice ✅

FunASR

FunASR ✅

1
2
3
4
5
6


(base) ubuntu@ubuntu:~$ conda create -n funasr python=3.10 -y

/home/anaconda3/lib/python3.13/site-packages/urllib3/connectionpool.py:1097:
InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'.
Adding certificate verification is strongly advised.
See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings

Conda 权限问题

当前用户是 ubuntu，但 /home/anaconda3/pkgs/cache/ 里的部分缓存文件属于别的用户（可能之前用 root 安装过），导致 conda 无法写入 cache，从而创建环境失败。

修复 Anaconda 权限

1
2


sudo chown -R ubuntu:ubuntu /home/anaconda3
chmod -R u+w /home/anaconda3

清理 Conda 缓存

1

conda clean -a -y

创建 FunASR 环境

1
2


conda create -n funasr python=3.10 -y
conda activate funasr

安装 FunASR 相关依赖

1
2
3


pip install funasr
pip install modelscope
pip install addict datasets simplejson sortedcontainers

安装 FFmpeg

1
2


sudo apt update
sudo apt install ffmpeg -y

这一步可能导致 NVIDIA 驱动异常，nvidia-smi 无法使用。

修复 NVIDIA 驱动

安装 DKMS 与编译环境

1
2


sudo apt update
sudo apt install dkms build-essential linux-headers-$(uname -r) -y

删除旧 NVIDIA 驱动（重要）

1
2


sudo apt purge '^nvidia-.*' -y
sudo apt autoremove -y

查看推荐驱动

1

ubuntu-drivers devices

输出类似：

1
2


driver : nvidia-driver-595-open - distro non-free recommended
driver : xserver-xorg-video-nouveau - distro free builtin

安装推荐驱动

1

sudo apt install nvidia-driver-595-open -y

重启系统

1

sudo reboot

验证 NVIDIA 驱动

1

nvidia-smi

输出示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


Sun May 17 10:47:22 2026

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:02:00.0  On |                  N/A |
|  0%   48C    P8             32W /  600W |     449MiB /  32607MiB |      2%      Default |
+-----------------------------------------+------------------------+----------------------+

运行FunASR

该代码基于 FunASR 官方仓库实现中文语音识别功能，可自动读取音频或视频文件，利用 FFmpeg 提取并转换为 16 kHz 单声道音频后，调用 FunASR 的 paraformer-zh、VAD 和标点模型完成语音转文字任务，同时结合逐字时间戳自动生成 txt、srt 和 vtt 三种格式的字幕文件。程序支持 GPU 或 CPU 推理、热词增强、自定义字幕长度与时长限制，并能够根据标点符号自动断句，适用于会议录音转写、视频字幕生成、课程字幕制作等场景。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195


import os
import subprocess
from funasr import AutoModel


# =========================
# 固定配置区
# =========================

INPUT_FILE = "a.m4a"

OUTPUT_PREFIX = None  # None 表示自动使用输入文件名，例如 a.m4a -> a.srt/a.vtt/a.txt

DEVICE = "cuda:0"  # 可改为 "cpu"

MAX_CHARS = 18

MAX_DURATION = 5000

HOTWORD = ""  # 多个热词用空格分开，例如 "亚洲 欧洲"


def ms_to_srt_time(ms):
    ms = int(ms)
    h = ms // 3600000
    ms %= 3600000
    m = ms // 60000
    ms %= 60000
    s = ms // 1000
    ms %= 1000
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"


def ms_to_vtt_time(ms):
    return ms_to_srt_time(ms).replace(",", ".")


def extract_audio(input_path, wav_path):
    cmd = [
        "ffmpeg", "-y",
        "-i", input_path,
        "-vn",
        "-ac", "1",
        "-ar", "16000",
        wav_path,
    ]
    subprocess.run(cmd, check=True)


def split_text_by_timestamp(text, timestamps):
    chars = text.replace(" ", "")
    n = min(len(chars), len(timestamps))
    items = []

    for i in range(n):
        items.append({
            "text": chars[i],
            "start": timestamps[i][0],
            "end": timestamps[i][1],
        })

    return items


def merge_chars_to_subtitles(items, max_chars=18, max_duration=5000):
    subtitles = []
    buf = []
    start = None
    end = None
    puncts = "。！？!?；;，,、"

    for item in items:
        if start is None:
            start = item["start"]

        buf.append(item["text"])
        end = item["end"]

        text_now = "".join(buf)
        duration = end - start

        should_cut = (
            len(text_now) >= max_chars
            or duration >= max_duration
            or item["text"] in puncts
        )

        if should_cut:
            subtitles.append((start, end, text_now))
            buf = []
            start = None
            end = None

    if buf:
        subtitles.append((start, end, "".join(buf)))

    return subtitles


def write_srt(subtitles, out_path):
    with open(out_path, "w", encoding="utf-8") as f:
        for i, (start, end, text) in enumerate(subtitles, 1):
            f.write(f"{i}\n")
            f.write(f"{ms_to_srt_time(start)} --> {ms_to_srt_time(end)}\n")
            f.write(f"{text}\n\n")


def write_vtt(subtitles, out_path):
    with open(out_path, "w", encoding="utf-8") as f:
        f.write("WEBVTT\n\n")
        for start, end, text in subtitles:
            f.write(f"{ms_to_vtt_time(start)} --> {ms_to_vtt_time(end)}\n")
            f.write(f"{text}\n\n")


def write_txt(text, out_path):
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(text.replace(" ", "") + "\n")


def main():
    input_path = INPUT_FILE

    if not os.path.exists(input_path):
        raise FileNotFoundError(f"输入文件不存在：{input_path}")

    base = OUTPUT_PREFIX or os.path.splitext(input_path)[0]

    audio_exts = [".wav", ".mp3", ".flac", ".m4a", ".aac", ".ogg"]
    video_exts = [".mp4", ".mkv", ".avi", ".mov", ".flv", ".wmv", ".webm"]

    ext = os.path.splitext(input_path)[1].lower()

    if ext in video_exts:
        wav_path = base + "_16k.wav"
        print(f"检测到视频文件，正在提取音频：{wav_path}")
        extract_audio(input_path, wav_path)
    elif ext in audio_exts:
        wav_path = base + "_16k.wav"
        print(f"检测到音频文件，正在转换为 16 kHz 单声道 wav：{wav_path}")
        extract_audio(input_path, wav_path)
    else:
        raise ValueError(f"不支持的文件格式：{ext}")

    model = AutoModel(
        model="paraformer-zh",
        model_revision="v2.0.4",
        vad_model="fsmn-vad",
        vad_model_revision="v2.0.4",
        punc_model="ct-punc-c",
        punc_model_revision="v2.0.4",
        device=DEVICE,
        disable_update=True,
    )

    res = model.generate(
        input=wav_path,
        batch_size_s=300,
        hotword=HOTWORD,
    )

    if not res:
        raise RuntimeError("FunASR 没有返回识别结果。")

    result = res[0]
    text = result.get("text", "")
    timestamps = result.get("timestamp", [])

    txt_path = base + ".txt"
    srt_path = base + ".srt"
    vtt_path = base + ".vtt"

    write_txt(text, txt_path)

    if timestamps:
        items = split_text_by_timestamp(text, timestamps)
        subtitles = merge_chars_to_subtitles(
            items,
            max_chars=MAX_CHARS,
            max_duration=MAX_DURATION,
        )
        write_srt(subtitles, srt_path)
        write_vtt(subtitles, vtt_path)

        print("识别完成：")
        print(srt_path)
        print(vtt_path)
        print(txt_path)
    else:
        print("识别完成，但当前结果没有 timestamp，只生成 txt：")
        print(txt_path)


if __name__ == "__main__":
    main()

1

time python asr.py

字体

中文字体

英文字体

Montserrat - Google Fonts 标题（高级感强）
Roboto - Google Fonts 正文（非常清晰）✅
Playfair Display - Google Fonts 科研/论文封面
Oswald - Google Fonts 强调标题 ✅

视频导出

发布平台	推荐导出分辨率	推荐导出帧率	剪映导出码率档位	自定义设置码率
抖音	1080p-4k	30帧	推荐	20MB/s-35MB/s
小红书	1080p-4k	30帧	推荐	10MB/s-25MB/s
B站	2k-4k	60帧	更高	40MB/s-45MB/s
YouTube	2k-4k	60帧	更高	40MB/s-45MB/s

封装格式

视频选择：MP4+H264。
音频选择：AAC 320 kbps+48000 Hz。

【影视飓风】剪辑全能必修课：从0基础到专业成片 ↩︎

目录

Vlog

LUT

背景音乐

ASR/TTS

VibeVoice

FunASR

Conda 权限问题

修复 Anaconda 权限

清理 Conda 缓存

创建 FunASR 环境

安装 FunASR 相关依赖

安装 FFmpeg

修复 NVIDIA 驱动

运行FunASR

字体

中文字体

英文字体

视频导出

封装格式