In the research group, when it comes to expense reimbursement, it is inevitable to create inbound and outbound documents based on invoices. It's manageable when there are few items, but it becomes really troublesome when there are many. So, while I was still involved in the reimbursement work, I created a small tool that can easily generate inbound and outbound documents, as shown in the image below:
Although it reduced the mental burden, it still required manual input of invoice numbers, codes, invoice dates, and other information. Now that I'm no longer involved in reimbursement work, I suddenly thought how great it would be if I could directly upload a file to obtain all the invoice information. So I just did it. The initial project was done in js/ts, but this time I switched to Python; after all, life is short, and I prefer Python.
The main code implementation for extracting invoice information is as follows, primarily relying on the pdfplumber library and regular expressions:
import pdfplumber
import re
from typing import List, Dict, Optional
class InvoiceExtractor:
def _invoice_pdf2txt(self, pdf_path: str) -> Optional[str]:
"""
Extract text from a PDF file using pdfplumber.
:param pdf_path: Path to the PDF file.
:return: Extracted text as a string, returns None if extraction fails.
"""
try:
with pdfplumber.open(pdf_path) as pdf:
text = '\n'.join(page.extract_text() for page in pdf.pages if page.extract_text())
return text
except Exception as e:
#print(f"Error extracting text from {pdf_path}: {e}")
return None
def _extract_invoice_product_content(self, content: str) -> str:
"""
Extract product-related content from the invoice text.
:param content: Complete text of the invoice.
:return: Extracted product-related content as a string.
"""
lines = content.splitlines()
start_pattern = re.compile(r"^(Goods or Taxable Services|Project Name)")
end_pattern = re.compile(r"^Total Price and Tax")
start_index = next((i for i, line in enumerate(lines) if start_pattern.match(line)), None)
end_index = next((i for i, line in enumerate(lines) if end_pattern.match(line)), None)
if start_index is not None and end_index is not None:
extracted_lines = lines[start_index:end_index + 1]
return '\n'.join(extracted_lines).strip()
return "No matching content found"
def construct_invoice_product_data(self, raw_text: str) -> List[Dict[str, str]]:
"""
Process the extracted text to construct a list of invoice product data.
:param raw_text: Extracted raw text.
:return: List of product data, each product as a dictionary.
"""
blocks = re.split(r'(?=Goods or Taxable Services|Project Name)', raw_text.strip())
records = []
for block in blocks:
lines = [line.strip() for line in block.splitlines() if line.strip()]
if not lines:
continue
current_record = ""
for line in lines[1:]:
if line.startswith("Total") or line.startswith("Total Price and Tax"):
continue
if line.startswith("*"):
if current_record:
self._process_record(current_record, records)
current_record = line
else:
if " " in current_record:
first_space_index = current_record.index(" ")
current_record = current_record[:first_space_index] + line + current_record[first_space_index:]
if current_record:
self._process_record(current_record, records)
return records
def _process_record(self, record: str, records: List[Dict[str, str]]):
"""
Process a single record and add it to the record list.
:param record: A single record string.
:param records: Record list.
"""
parts = record.rsplit(maxsplit=7)
if len(parts) == 8:
try:
records.append({
"product_name": parts[0].strip(),
"specification": parts[1].strip(),
"unit": parts[2].strip(),
"quantity": parts[3].strip(),
"unit_price": float(parts[4].strip()),
"amount": float(parts[5].strip()),
"tax_rate": parts[6].strip(),
"tax_amount": float(parts[7].strip())
})
except ValueError as e:
print(f"Failed to parse record: {record}, Error: {e}")
pass
In the end, a dictionary will be obtained containing the invoice's product name, specification, unit, quantity, unit price, total price, tax rate, and tax amount. Following this script, combined with fastapi and vue3, I created an application that allows users to drag and drop to obtain invoice information and export inbound and outbound documents:
Of course, I am no longer responsible for reimbursement work, but what I created benefits my junior colleagues, regardless of whether they use it or not; I have made it.