This n8n template takes a video and extracts frames from it, which are used with a multimodal LLM to generate a script. The script is then passed to the same multimodal LLM to generate a voiceover clip.
This template was inspired by processing and narrating a video with GPT’s visual capabilities and the TTS API.
How it works:
1. Video is downloaded using the HTTP node.
2. A Python code node is used to extract the frames using OpenCV.
3. A loop node is used to batch the frames for the LLM to generate partial scripts.
4. All partial scripts are combined to form the full script, which is then sent to OpenAI to generate audio from it.
5. The finished voiceover clip is uploaded to Google Drive.
Requirements:
– OpenAI for LLM
– Ideally, a mid-range (16GB RAM) machine for acceptable performance!
Customizing this workflow:
– For larger videos, consider splitting them into smaller clips for better performance.
– Use a multimodal LLM that supports video fully, such as Google’s Gemini.