Tag Archives: Derek Willis

Why extracting data from PDFs is still a nightmare for data experts

“The biggest [drawback] is that they are probabilistic prediction machines and will get it wrong in ways that aren’t just ‘that’s the wrong word’,” Willis explains. “LLMs will sometimes skip a line in larger documents where the layout repeats itself, I’ve found, where OCR isn’t likely to do that.” AI researcher and data journalist Simon… Read More »