This n8n workflow demonstrates how we can use Multimodal LLMs to parse and extract from PDF documents in n8n.

In this particular scenario, we’re passing a candidate’s CV/resume to an AI which filters out unqualified applications. However, this sneaky candidate has added in a hidden prompt to bypass our bot! Whatever will we do? No fret, using AI Vision is one approach to solve this problem… read on!

How it works

Our candidate’s CV/Resume is a PDF downloaded via Google Drive for this demonstration.

The PDF is then converted into an image PNG using a tool called Stirling PDF. Since the hidden prompt has a white font color, it is invisible in the converted image.

The image is then forwarded to a Basic LLM node to process using our multimodal model – in this example, we’ll use Google’s Gemini 1.5 Pro.

In the Basic LLM node, we’ll need to set a User Message with the type of Binary. This allows us to directly send the image file in our request.

The LLM is now immune to the hidden prompt and its response is as expected.

Requirements

– Google Gemini API Key. Alternatively, GPT-4 will also work for this use-case.

– Stirling PDF or another service which can convert PDFs into images. Note for data privacy, this example uses a public API and it is recommended that you self-host and use a private instance of Stirling PDF instead.

Customising the workflow

Swap out the manual trigger for another trigger such as a webhook to integrate into your existing services.

This example demonstrates a validation use-case, i.e., “does the candidate look qualified?”. You can try additionally extracting data points such as years of experience, previous companies, etc.