Data quality headaches & progress

Extracting text from PDFs is in fact AI-complete/complex. A comprehensive list of issues is here: http://filingdb.com/b/pdf-text-extraction Apart from the fact that no one yet managed to extract knowledge from text in bulk, there are two main problems in unsupervised learning from large English text corpora: The availability of relevant, useful data and the quality of […]