Read the latest thought leadership and industry news from the experts at Gimmal!
Recently, Gimmal interviewed Reid Smith, co-founder and CEO at i2k Connect. The mission of i2k Connect is to revolutionize information discovery using its novel artificial intelligence (AI) technology, informed by industry knowledge, to transform unstructured documents into structured data.
This post comes from Alexander Goerke, CEO and Founder at Skilja. The original article can be seen here. Automatic, context based classification for mailrooms has proven to generate significant ROI and acceleration of processes in the last few years. But we have also seen failures and disappointments. I have managed and monitored many of these projects in the past and would like to share 10 golden rules derived from my experience to make a mailroom classification process successful. Plan enough time to prepare: Changing the way how business processes originate is a severe organizational change for the company. Although work in the mailroom is often considered of minor importance, this is the area where everything starts. Any error here has significant effects on quality of service and response time. Any improvement will ensure that responses to requests are faster and customer satisfaction is maintained. So don’t begin with implementation right away. Forget about technology for a moment, put aside tools and spend time to analyze existing processes upfront. Take them and define clearly which of them you can automate and what would be the best way to approach this task. Document the findings and have them signed off by the customer. Involve stakeholders: Classification drives and initiates business processes. In each organization you find existing stakeholders and subject matter experts (SMEs) who are familiar with the processes and have done manual sorting for years. They know the documents that arrive, they know explicit rules but also a lot of implicit procedures. Identify the SMEs and invite them to the team that defines the classification scheme. You can learn a lot from them. Create incentives for their active participation so you get access to the hidden knowledge which they need to share with you to make you understand their business. Often a simple, straightforward valuation of the work they have been doing until now is enough to get them involved. Define goals: Clearly define goals and have them signed off by the team. Set the expectation and explain to the team what classification can achieve – and what it can’t! Very often clients have unrealistic expectations on the performance of the new system and wrong assumptions about the manual process at the same time. Depending on the kind of documents manual classification creates as much as 5% of errors and misclassification. It makes sense to hold a general educational session about classification technologies and the preconditions for successful classification. If clients are introduced (on a high level) to the technological foundation, they will understand better that even an automation rate of 70% with an error rate of 3% can be a big success. And that 99% are not realistic. Make sure that everybody understands that quality can only be measured statistically and that it makes no sense to focus on single documents that might have been misclassified. Use good data sets: Get a large and representative set of documents from several weeks to account for changes in content by weekday. Typically 1,000 to 2,000 documents need to be reviewed. Work with the SME team to sort the documents manually and make sure to understand why they sort them as they do. Often the reasons they give for sorting contain valuable hints on exceptions that need to be coded as rules and cannot be trained automatically. Create 3 data sets from these documents: A training or development set, a test set and a reference set and tag them with the correct classes. The last one will be touched only for finally measuring the quality to achieve sign off. The test set will be continuously used to measure effects of training and rules during development. Use clean data: Make sure that you clean the documents before using them for classification. Documents contain “noise” like footers, general terms and disclaimers that are not relevant for the content. These need to be removed by prior analysis (e.g font size) as they will for sure mislead the classifier. For e-mails make sure to remove text from old threads (which can be identified by indents) that often cascades over many responses. Start small: Do not attempt to solve the complete categorization problem at once. Focus on the main categories and start with them. In the beginning use the 10 major classes (classes with the highest number of documents) and create a working scheme. Then expand by adding more classes in steps of 10 or 20 while continuously measuring quality. Use hierarchy: If the classification software allows to use hierarchies (it should!) make intelligent use of it. Main classes can be easily identified and subsequent classification can further break them down into the desired target classes corresponding to business processes. As the differences between documents become smaller and smaller the closer the topics are, it is easier to handle these distinctions in a hierarchy. Together with rule 6 hierarchy provides and easy and straightforward way to reduce complexity and build the scheme step by step. Run continuous tests: Make sure that you test after each change. Quality can be assured if you know what you are doing. Running automated tests after each change allows you to better understand the reasons for possible deterioration. Searching these reasons later might prove very difficult. Ramp up production: Don’t attempt to process the complete volume from day one when you start production. Rather start with a fraction of the volume (10%) or, if possible, run automated classification in parallel for some days. This allows the business users to accommodate themselves with the system and allows you to correct errors that shine up in production only. Ramp up the volume step by step until you process the complete volume after a few weeks. Monitor in production: Measure and monitor quality in production. To achieve this you need statistics that show how many documents had to be classified manually and for how many the class was changed by users. Deterioration starts with day one. This is not necessarily the fault of the classification software but more often due to slow but steady changes in the document content over time. By monitoring the system it can be tuned during production to stay up to date. If you are using a system with automated learning, monitoring is essential to find out if the quality really goes up. I hope you find these practical tips useful for your projects. Please leave a comment below to share your thoughts. If you want to learn more about auto-classification, please check our regular blog onwww.skilja.com