HAIDi blog
Publishing our work.
2017-03-10
Reading documents with Aspose
We had a solution for document transformation implemented in .NET on Windows machines with dependency on MS Outlook and MS Office products. In order to automate this very important step in the pipeline and make it scalable, we wanted to build this as an asynchronous, highly scalable microservice without heavy dependencies.
Solution
The Aspose SDK provided us a Java API which we are able to run without the dependency on Windows environment. We are now able to build lightweight text extraction and file conversion microservice, packaged as a docker image. It is now possible to run these microservices in a highly parallel and scalable way in our Apache Mesos environment. It is also very simple to setup a docker environment and deploy these microservices in the environment of our customers.
Experience
We have implemented basic functionality of office documents converter by ourselves in the past and tried several other free and paid solutions. But we found out the Aspose.Total for Java be the clear winner in this case. It has a lot of functionality, is easy to use with clear documentation and no dependency on Windows or native Office .NET API. It is easily scalable and embeddable, and has convenient license policy. We are also using the functionality of text extraction from machine readable PDF and Word files with metadata.
Next Steps
We plan to experiment with the OCR-library and try to collaborate with Aspose to provide support for East European Languages. In the next coming months, we are going to implement the document generation as well and we are looking forward to use Aspose for this scenario too.
Summary
We see Aspose library as the key driver of several of our core components in document analysis pipeline. It is easy to recommend Aspose to anyone who has a need to deal with different Microsoft Office and PDF documents in different scenarios ranging from text extraction to document conversion.