banner
Riceneeder

Riceneeder

卜得山火贲之变艮卦,象曰:装饰既成,宜静宜止。2025下半年,不宜躁进,宜守正持中,沉淀与反思,将为日后之再发打下基石。
github
email

Extract product details from the PDF invoice

In the research group, when it comes to expense reimbursement, it is inevitable to create inbound and outbound documents based on invoices. It's manageable when there are few items, but it becomes really troublesome when there are many. So, while I was still involved in the reimbursement work, I created a small tool that can easily generate inbound and outbound documents, as shown in the image below:

jietuchurukuold

Although it reduced the mental burden, it still required manual input of invoice numbers, codes, invoice dates, and other information. Now that I'm no longer involved in reimbursement work, I suddenly thought how great it would be if I could directly upload a file to obtain all the invoice information. So I just did it. The initial project was done in js/ts, but this time I switched to Python; after all, life is short, and I prefer Python.

The main code implementation for extracting invoice information is as follows, primarily relying on the pdfplumber library and regular expressions:

import pdfplumber
import re
from typing import List, Dict, Optional

class InvoiceExtractor:
    def _invoice_pdf2txt(self, pdf_path: str) -> Optional[str]:
        """
        Extract text from a PDF file using pdfplumber.
        :param pdf_path: Path to the PDF file.
        :return: Extracted text as a string, returns None if extraction fails.
        """
        try:
            with pdfplumber.open(pdf_path) as pdf:
                text = '\n'.join(page.extract_text() for page in pdf.pages if page.extract_text())
            return text
        except Exception as e:
            #print(f"Error extracting text from {pdf_path}: {e}")
            return None

    def _extract_invoice_product_content(self, content: str) -> str:
        """
        Extract product-related content from the invoice text.
        :param content: Complete text of the invoice.
        :return: Extracted product-related content as a string.
        """
        lines = content.splitlines()
        start_pattern = re.compile(r"^(Goods or Taxable Services|Project Name)")
        end_pattern = re.compile(r"^Total Price and Tax")

        start_index = next((i for i, line in enumerate(lines) if start_pattern.match(line)), None)
        end_index = next((i for i, line in enumerate(lines) if end_pattern.match(line)), None)

        if start_index is not None and end_index is not None:
            extracted_lines = lines[start_index:end_index + 1]
            return '\n'.join(extracted_lines).strip()
        return "No matching content found"

    def construct_invoice_product_data(self, raw_text: str) -> List[Dict[str, str]]:
        """
        Process the extracted text to construct a list of invoice product data.
        :param raw_text: Extracted raw text.
        :return: List of product data, each product as a dictionary.
        """
        blocks = re.split(r'(?=Goods or Taxable Services|Project Name)', raw_text.strip())
        records = []

        for block in blocks:
            lines = [line.strip() for line in block.splitlines() if line.strip()]
            if not lines:
                continue

            current_record = ""
            for line in lines[1:]:
                if line.startswith("Total") or line.startswith("Total Price and Tax"):
                    continue

                if line.startswith("*"):
                    if current_record:
                        self._process_record(current_record, records)
                    current_record = line
                else:
                    if " " in current_record:
                        first_space_index = current_record.index(" ")
                        current_record = current_record[:first_space_index] + line + current_record[first_space_index:]

            if current_record:
                self._process_record(current_record, records)

        return records

    def _process_record(self, record: str, records: List[Dict[str, str]]):
        """
        Process a single record and add it to the record list.
        :param record: A single record string.
        :param records: Record list.
        """
        parts = record.rsplit(maxsplit=7)
        if len(parts) == 8:
            try:
                records.append({
                    "product_name": parts[0].strip(),
                    "specification": parts[1].strip(),
                    "unit": parts[2].strip(),
                    "quantity": parts[3].strip(),
                    "unit_price": float(parts[4].strip()),
                    "amount": float(parts[5].strip()),
                    "tax_rate": parts[6].strip(),
                    "tax_amount": float(parts[7].strip())
                })
            except ValueError as e:
                print(f"Failed to parse record: {record}, Error: {e}")
                pass

In the end, a dictionary will be obtained containing the invoice's product name, specification, unit, quantity, unit price, total price, tax rate, and tax amount. Following this script, combined with fastapi and vue3, I created an application that allows users to drag and drop to obtain invoice information and export inbound and outbound documents:

screenshot

Of course, I am no longer responsible for reimbursement work, but what I created benefits my junior colleagues, regardless of whether they use it or not; I have made it.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.