推荐实时视频模型:gemini-2.0-flash - Recommend a real-time video input AI model

推荐一个可以实时读取摄像头/屏幕的多模态AI模型:google gemini 2.0 flash。可以在这里试用。

Recommend a multimodal AI model capable of real-time camera/screen reading: Google Gemini 2.0 Flash. You can try it out here.


据我所知,这应该是第一个支持实时视频的模型。如果这个模型可以开放视频API接口的话,那与AI聊天这一构想就可以实现了。

不愧是Google,财大气粗,免费送给开发者的玩具都这么好。

As far as I know, this should be the first model to support real-time video. If the video API interface becomes publicly available, the concept of AI video chat could be fully realized.

No surprise from Google — deep pockets mean even the free tools they give developers are this impressive.


它现在还有一个问题:如果让其语音输出回答,会把汉字读成日本字。比如把“大丈夫”读成daijoubu。当然让其打字回复就没有问题了。

That said, it still has one issue: when responding via voice output, it pronounces Chinese characters as Japanese. For example, it reads 大丈夫 as daijoubu. Of course, this issue doesn’t occur when it responds via text.