推荐实时视频模型：gemini-2.0-flash - Recommend a real-time video input AI model

发表于 - Posted on 2024/12/17 编辑于 - Edited on 2025/04/20 字数 - Word count: 265 阅读时间 - Reading time ≈ 1 mins.

推荐一个可以实时读取摄像头/屏幕的多模态AI模型：google gemini 2.0 flash。可以在这里试用。

Recommend a multimodal AI model capable of real-time camera/screen reading: Google Gemini 2.0 Flash. You can try it out here.

据我所知，这应该是第一个支持实时视频的模型。如果这个模型可以开放视频API接口的话，那与AI聊天这一构想就可以实现了。

不愧是Google，财大气粗，免费送给开发者的玩具都这么好。

As far as I know, this should be the first model to support real-time video. If the video API interface becomes publicly available, the concept of AI video chat could be fully realized.

No surprise from Google — deep pockets mean even the free tools they give developers are this impressive.

它现在还有一个问题：如果让其语音输出回答，会把汉字读成日本字。比如把“大丈夫”读成daijoubu。打字回复就没有问题了。

That said, it still has one issue: when responding via voice output, it pronounces Chinese characters as Japanese. For example, it reads 大丈夫 as daijoubu. This issue doesn’t occur when it responds via text.