News > Supercharging Document Processing with LLMs and QLoRA Part 1

7 Min reading time

Supercharging Document Processing with LLMs and QLoRA Part 1

26. 08. 2025

Overview

Foundations of document processing with LLMs and QLoRA. Explore Intelligent Document Processing (IDP), LoRA, quantization, and fine-tuning for scalable AI solutions.

Part 1: Foundations and Background

Let’s face it, traditional document processing can be a real hassle. Imagine dealing with piles of invoices, receipts, and forms, all requiring manual entry, repetitive checks, and endless adjustments whenever something new pops up. It’s tedious, error-prone, and a big drain on resources.

Enter Intelligent Document Processing (IDP); a solution powered by artificial intelligence that’s revolutionizing how businesses handle documents. Instead of manual headaches, IDP automates extraction, sorting, and analysis of key information, saving time and significantly reducing errors.

But here’s the catch, traditional IDP solutions often stumble when documents become complex or deviate from predefined templates. They’re rigid and slow to adapt. That’s where fine-tuning large language models (LLMs) changes the game. LLMs excel at understanding context and spotting patterns, even in unstructured or messy documents. And with Quantized Low-Rank Adaptation (QLoRA), we can harness this power efficiently, even without access to heavy-duty hardware.

In this two-part blog series, we’ll dive into how fine-tuning an LLM with QLoRA helps us tackle document processing challenges head-on, making the whole process smoother, smarter, and way more scalable. In this first part, we’ll cover the foundational concepts and background information you need to understand the technology

What are Large Language Models (LLMs)?

Imagine teaching (training) an AI system by feeding it massive amounts of text-books, web sites, conversations etc… Large Language Models (LLMs), are precisely these AI systems. They’re built on transformer-based architectures, allowing them to grasp context, spot complex patterns, and generate impressively human-like text. Modern LLMs have revolutionized tasks ranging from simple Q&A to generating code and automating language-driven processes.

Today’s state-of-the-art models operate at a staggering scale, with ones like GPT and Claude containing hundreds of billions of parameters. These parameters are essentially the “knowledge” of the model which are stored as numerical weights in vast matrices. Training such behemoths requires immense computational resources: clusters of thousands of specialized GPUs working in parallel for weeks or even months, consuming megawatts of power and costing millions of dollars per training run.

What makes these models so resource-intensive is the sheer number of weight matrices W distributed throughout their architecture. Each of these matrices represents learned patterns from the training data, collectively forming the model’s “intelligence.” When we fine-tune an LLM for specific tasks, we traditionally update all these weight values incrementally, a process that inherits the original model’s enormous resource requirements.

Introduction to QLoRA

Fine-tuning such powerful LLMs can quickly become prohibitively expensive, especially as model sizes balloon into billions of parameters. Quantized Low-Rank Adaptation, or QLoRA, steps in to alleviate these challenges by addressing two critical questions: How can we reduce the memory footprint of these massive weight matrices? And how can we efficiently update only what’s necessary?

Low-Rank Adaptation (LoRA)

Let’s start with Low-Rank Adaptation (LoRA). Traditional fine-tuning updates the entire weight matrix $ W $ for each layer in the network. LoRA instead introduces a clever parameterization trick: instead of directly modifying $ W $, we add a decomposed low-rank update:

W’ = W + BA

Where B and A are much smaller matrices with dimensions chosen so that their product has the same shape as W. If W is an m × n matrix, then B would be m × r and A would be r × n, with the rank r typically much smaller than both m and n (often 8, 16, or 32).

This reduces the number of trainable parameters from m × n to just r × (m + n) — often a reduction of 99% or more.

Quantization

Quantization tackles the problem from another angle by reducing the precision of weight representations. In standard neural networks, weights are typically stored as 32-bit floating-point numbers (FP32). Quantization compresses this representation to lower precision formats like 8-bit integers, 4-bit integers, or even binary values. For example, converting from FP32 to 4-bit integers:

$$ X_{\mathrm{quantized}} = \operatorname{round}\!\left( \frac{2^{b}-1}{\max\!\bigl\lvert X_{\mathrm{FP32}}\bigr\rvert} \cdot X_{\mathrm{FP32}} \right) $$

Where b is the target bit-width (4 in this case). This compression reduces memory requirements by a factor of 8 compared to FP32, at the cost of some precision.

QLoRA Framework

QLoRA cleverly combines two techniques into a unified framework. The process works as follows:

Compress Model Size: The pre-trained model weights are compressed to a low-precision format (typically 4-bit), significantly reducing memory requirements.
Use Lightweight Adapters: Instead of updating all the model’s weights, small, efficient adapters A and B matrices) are added. These adapters are the only parts that get updated during training.
Keep Original Weights Unchanged: Only these lightweight LoRA parameters are updated during training, while the original compressed weights remain unchanged.

This approach offers several benefits

Memory usage is reduced significantly compared to traditional fine-tuning methods.
The quality of the fine-tuned model remains comparable to full fine-tuning.
The updated model can be easily deployed with minimal additional effort by integrating the LoRA parameters back into the original weights.

Impact of QLoRA

QLoRA democratizes LLM customization, enabling researchers and developers with modest computational resources to adapt state-of-the-art models for specialized applications, a capability previously restricted to well-funded tech giants with access to vast computing infrastructure.

CROZ IDP Benchmark Dataset

To properly evaluate our approach, we created a comprehensive benchmark by combining several publicly available datasets. We standardized their formats, shuffled the data, and selected 5,000 representative samples. We call this the CROZ IDP Benchmark dataset. The source datasets include:

• OmniAI OCR Benchmark: A go-to dataset for evaluating key information extraction.
• CORD v2: Ideal for extracting information from receipts.
• FUNSD: Great for structured data extraction from forms.
• SROIE: Focuses on extracting data from receipts.
• Receipt VLM Extraction: Specialized for receipt processing using visual language models.
• Merit: Offers static structure documents for consistent extraction.
• Donut Data: Valuable for standardizing invoice extraction.

Combining and Preprocessing

Each dataset was systematically unified into a consistent format, including:

• OCR outputs in plain text (extracted from provided images).
• Ground truth JSON annotations specifying exact extraction fields like document IDs, dates, amounts, and customer/vendor names.

Example of a Ground Truth JSON:

 {
  "gt_parse": {
  "company": "SYARIKAT PERNIAGAAN GIN KEE",
  "date": "02/01/2018",
  "address": "NO 290, JALAN AIR PANAS, SETAPAK, 53200, KUALA LUMPUR.",
  "total": "93.07"
  }
 }

The unified dataset was then split into training (4000 instances), validation (500 instances), and test (500 instances), corresponding to an 80/10/10 split for thorough evaluation.

Adapting to ChatML Format

To align with our training approach using the Unsloth bootstrap notebook, we transformed each dataset entry into the ChatML format. This conversion creates a structured conversation that directly connects to how LLMs learn through next-token prediction. Each data point follows a three-part structure:

A system prompt defining the extraction task
A user message containing OCR text and an empty JSON schema
An assistant response with the correctly filled JSON structure

Example of a ChatML Training Instance:

 <|im_start|>system
 You will be provided a piece of text and a schema of extraction in JSON format.
 Your task is to extract the given information and return a valid JSON object.
 <|im_end|>
 <|im_start|>user
 Text:
 3 18 01 O14
 SYARIKAT PERNIAGAAN GIN KEE
 (81109-A)
 NO 290, JALAN AIR PANAS,
 SETAPAK,
 53200, KUALA LUMPUR.
 TEL : 03-40210276
 GST ID : 000750673920
 SIMPLIFIED TAX INVOICE
 (...)
 CASH _
 Doc No * CS00012507 Date: 02/01/2018
 Cashier USER Time: 16:58:00
 Salesperson Ref. :
 ttem ss Qty, S/Price Amount Tax_
 1811 1 13.57 1357 SR
 82X3
 1042 4 18.55 7420 SR
 7’ ¥ 35# CORRUGATED ROOFING SHEET
 1921 1 5.30 §30 SR
 NAIL (PER/PACK)- RM5
 ~~ JotalQty 6" 9307
 Total Sales (Excluding GST) : 87.80
 Discount : 0.00
 TotalGST . 5.27
 Rounding : 0.00
 Total Sales (Inclusive of GST) : 93.07
 CASH : 93.07
 Change : 0.00
 [esr SUMMARY OO
 Tax Code % Amt(RM) Tax(RM)
 SR 6 87.80 5.27
 Total : 87.80 5.27
 GOONS SOLD ARE NOT RETURNABLE, THANK YOU d
 Schema:
 {
  "gt_parse": {
    "company": "",
    "date": "",
    "address": "",
    "total": ""
  }
 }
 <|im_end|>
 <|im_start|>assistant
 {
  "gt_parse": {
    "company": "SYARIKAT PERNIAGAAN GIN KEE",
    "date": "02/01/2018",
    "address": "NO 290, JALAN AIR PANAS, SETAPAK, 53200, KUALA LUMPUR.",
    "total": "93.07"
  }
 }
 <|im_end|>

During training, the model learns by predicting the next token in the sequence, based on all the preceding tokens. By presenting the model with the complete ChatML structure for each example, it learns exactly how the assistant’s response should follow the system and user messages. In essence, you are showing the model the beginning of a conversation turn (system and user) and teaching it to complete that turn correctly (the assistant’s response) through
this next-token prediction process.

This builds the crucial ability to generate the expected structured output. Once you want to evaluate the model, you provide the system and user messages, and the trained model applies its learned predictive ability to generate the assistant’s response (in our case a JSON). This is precisely how you interact with your favorite chatbot!

To Be Continued…

In this first part of our two-part blog series, we’ve covered the foundational concepts behind using Large Language Models with QLoRA for document processing. We’ve explored what makes LLMs so powerful, how QLoRA makes them more accessible, and how we prepared our benchmark dataset for effective fine-tuning.

In Part 2, we’ll dive into the technical implementation details, showing you exactly how to set up and train your model using Unsloth. We’ll also share our impressive results and performance gains, discuss practical business applications, and provide valuable insights from our experience.

Stay tuned for the second part where we’ll complete this journey through supercharging document processing with LLMs and QLoRA!

Signup

News

Your monthly dose of news

Get in touch

If you have any questions, we are one click away.

Schedule a call with an expert

AI and Data Mainframe