Client Overview
With over 15,000 employees, active in across 150 countries offering expertise in Health, Tax and Accounting, Governance, Risk and Compliance
Business Need
Their current classification process of 10-K forms was manual, error prone and not scalable. With 10M documents and 36 target categories they wanted an intelligent classification model.
Key Features
- Input: XML files / Output: Text files using Parser – Apache Tika + Custom
- Evaluated DL4J, Naïve Baiyes and TensorFlow as Classifiers that run models and test set
- Reviewer – results of classifier including audit logs, docs parsed and reviewing outliers
- Custom Reviewer to capture results of classification iterations in terms of accuracy
Results
- Additional ~1800 documents included in the POC in addition to the original ~2000 to validate the accuracy.
- Naïve Bayes provided the highest level of accuracy(~95%). Accuracy can be further enhanced by including external feature set
- Scalability can be achieved by using / building Big Data frameworks for distributed computing