Document Management System using OpenKM on CentOS 6 Deployment Guide

There are many reasons why an organization may want to implement on a Document Management System. Often these days, one company will not do business with another unless the other party can demonstrate a certain level of control over the documents they create.

A DMS is implemented in an organization because it want ensure that people can trust the information they are looking at and know that it is the latest and truest version.

Where information is not current, disaster can occur. Imagine sending the wrong contract to a company for signature, or an outdated engineering design to a client.

You should know what a DMS can do for you, and set realistic expectations. Be aware of the challenges that will be surface and be prepared to take steps to overcome them.

Set the correct expectations, and get the buy in from our users and managers, over time, people will begin to understand the important role the DMS system plays in the organization.

1.1 Authority

The Philippine Linux Windows Users Group (PH-LWUG) developed this document to provide guidance on how to implement DMS specifically OpenKM on Centos 6.

PH-LWUG’s objective is to develop user guides on effective implementation of Open Source products and if possible, integrate it to Proprietary Software for better result and enable flexibility.

This guideline has been prepared for the use of individuals and organizations without warranty. It may be used by on a voluntary basis and is not subject to copyright, though attribution is desired.

1.2 Purpose and Scope

The purpose of this publication is to provide individual, organizations, professionals, non-professionals and hobbyist in the implementation and configuration of a Document Management System using OpenKM in CentOS 6. The scope of this publication is limited to the use of OpenKM 6.2 as the DMS installed in a CentOS 6 as the Operating System.

1.3 Audience

The primary audience for this publication is IT professionals, system administrators, and others who are interested in implementing or planning to implement a Document Management System.

1.4 Document Structure

The remainder of this document is organized into the following sections:

  • Section 2 provides an introduction to CentOS6, OpenKM, Document Management System and Terminologies.
  • Section 3 provides guidelines on implementing CentOS6 with minimum features.
  • Section 4 provides guidelines on installation of applications and dependencies needed to deploy OpenKM on CentOS6.
INTRODUCTION TO CENTOS, OPENKM AND DMS

This document provides deployment guidelines for Document Management System using OpenKM in CentOS6.

2.1 What is CentOS

CentOS is a free operating system distribution based upon the Linux kernel. It is derived entirely from the Red Hat Enterprise Linux (RHEL) distribution. CentOS exists to provide a free enterprise class computing platform and strives to maintain 100% binary compatibility with its upstream source, Red Hat. CentOS stands for Community Enterprise Operating System.

In July 2010 CentOS overtook Debian to become the most popular Linux distribution for web servers, with almost 30% of all Linux web servers using it. But in January 2012, it lost that position to Debian once again.

2.2 What is OpenKM

OpenKM is a Free/Libre document management system that provides a web interface for managing arbitrary files. OpenKM includes a content repository, Lucene indexing, and jBPM workflow. The OpenKM system was developed using Java technology.

2.3 What is a Document Management System

A Document Management System is a computer system used to track and store electronic documents that is usually capable of keeping track of the different versions modified by different users. Document Management System is often viewed as a component of enterprise content management (ECM) systems and related to digital asset management, document imaging, workflow systems and records management systems.

2.3.1 DMS Components

Document management systems commonly provide storage, versioning, metadata, security, as well as indexing and retrieval capabilities.

Topic Description
Metadata
is typically stored for each document. Metadata may, for example, include the date the document was stored and the identity of the user storing it. The DMS may also extract metadata from the document automatically or prompt the user to add metadata. Some systems also use optical character recognition on scanned images, or perform text extraction on electronic documents. The resulting extracted text can be used to assist users in locating documents by identifying probable keywords or providing for full text search capability, or can be used on its own. Extracted text can also be stored as a component of metadata, stored with the image, or separately as a source for searching document collections.

Integration Many document management systems attempt to integrate document management directly into other applications, so that users may retrieve existing documents directly from the document management system repository, make changes, and save the changed document back to the repository as a new version, all without leaving the application. Such integration is commonly available for office suites and e-mail or collaboration/groupware software. Integration often uses open standards such as ODMA, LDAP, WebDAV and SOAP to allow integration with other software and compliance with internal controls.

Capture primarily involves accepting and processing images of paper documents from scanners or multifunction printers. Optical character recognition (OCR) software is often used, whether integrated into the hardware or as stand-alone software, in order to convert digital images into machine-readable text. Optical mark recognition (OMR) software is sometimes used to extract values of check-boxes or bubbles. Capture may also involve accepting electronic documents and other computer-based files.

Indexing tracks electronic documents. Indexing may be as simple as keeping track of unique document identifiers; but often it takes a more complex form, providing classification through the documents’ metadata or even through word indexes extracted from the documents’ contents. Indexing exists mainly to support retrieval. One area of critical importance for rapid retrieval is the creation of an index topology.

Storage Store electronic documents. Storage of the documents often includes management of those same documents; where they are stored, for how long, migration of the documents from one storage media to another and eventual document destruction.

Retrieval Retrieve the electronic documents from the storage. Although the notion of retrieving a particular document is simple, retrieval in the electronic context can be quite complex and powerful. Simple retrieval of individual documents can be supported by allowing the user to specify the unique document identifier, and having the system use the basic index (or a non-indexed query on its data store) to retrieve the document. More flexible retrieval allows the user to specify partial search terms involving the document identifier and/or parts of the expected metadata. This would typically return a list of documents which match the user’s search terms. Some systems provide the capability to specify a Boolean expression containing multiple keywords or example phrases expected to exist within the documents’ contents. The retrieval for this kind of query may be supported by previously built indexes or may perform more time-consuming searches through the documents’ contents to return a list of the potentially relevant documents.

Distribution A published document for distribution has to be in a format that can not be easily altered. As a common practice in law regulated industries, an original master copy of the document is usually never used for distribution other than archiving. If a document is to be distributed electronically in a regulatory environment, then the equipment tasking the job has to be quality endorsed AND validated. Similarly quality endorsed electronic distribution carriers have to be used. This approach applies to both of the systems by which the document is to be inter-exchanged if the integrity of the document is highly in demand.

Security Document security is vital in many document management applications. Compliance requirements for certain documents can be quite complex depending on the type of documents. For instance, in the United States, the Health Insurance Portability and Accountability Act (HIPAA) requirements dictate that medical documents have certain security requirements. Some document management systems have a rights management module that allows an administrator to give access to documents based on type to only certain people or groups of people. Document marking at the time of printing or PDF-creation is an essential element to preclude alteration or unintended use.

Workflow is a complex process and some document management systems have a built-in workflow module. There are different types of workflow. Usage depends on the environment to which the electronic document management system (EDMS) is applied. Manual workflow requires a user to view the document and decide whom to send it to. Rules-based workflow allows an administrator to create a rule that dictates the flow of the document through an organization: for instance, an invoice passes through an approval process and then is routed to the accounts-payable department. Dynamic rules allow for branches to be created in a workflow process. A simple example would be to enter an invoice amount and if the amount is lower than a certain set amount, it follows different routes through the organization. Advanced workflow mechanisms can manipulate content or signal external processes while these rules are in effect.

Collaboration should be inherent in an EDMS. In its basic form, a collaborative EDMS should allow documents to be retrieved and worked on by an authorized user. Access should be blocked to other users while work is being performed on the document. Other advanced forms of collaboration allow multiple users to view and modify (or markup) a document at the same time in a collaboration session. The resulting document should be viewable in its final shape, while also storing the markups done by each individual user during the collaboration session.

Versioning is a process by which documents are checked in or out of the document management system, allowing users to retrieve previous versions and to continue work from a selected point. Versioning is useful for documents that change over time and require updating, but it may be necessary to go back to or reference a previous copy.

Searching finds documents and folders using template attributes or full text search. Documents can be searched using various attributes and document content.

Publishing a document involves the procedures of proofreading, peer or public reviewing, authorizing, printing and approving etc. Those steps ensure prudence and logical thinking. Any careless handling may result in the inaccuracy of the document and therefore mislead or upset its users and readers. In law regulated industries, some of the procedures have to be completed as evidenced by their corresponding signatures and the date(s) on which the document was signed. Refer to the ISO divisions of ICS 01.140.40 and 35.240.30 for further information.

The published document should be in a format that is not easily altered without a specific knowledge or tools, and yet it is read-only or portable.

Reproduction Document/image reproduction is key when thinking about implementing a system. It’s great to be able to put things in, but how are you going to get them out? An example of this is building plans. How will plans be scanned and scale be retained when printed?


2.3.2 Document Control

Your documents — procedures, work instructions, policy statements, etc. — provide evidence of documents under control. Failing to comply could cause fines, the loss of business, or damage to your business reputation.

The basic requirement for document control requires that you establish and document a procedure for:
  • Reviewing and approving documents prior to release
  • Reviews and approvals
  • Ensuring changes and revisions are clearly identified
  • Ensuring that relevant versions of applicable documents are available at their “points of use”
  • Ensuring that documents remain legible and identifiable
  • Ensuring that external documents like customer supplied documents or supplier manuals are identified and controlled
  • Preventing “unintended” use of obsolete documents
CENTOS 6.2

3.1 CentOS 6.2 minimal install

Linux distributions have options on what packages to install in a server. A minimal install is the most basic deployment of Linux distribution with basic packages for a Linux Operating System to run on a machine. CentOS 6.2 releases “Minimal” ISO which is downloadable on CentOS website. Download the minimal install ISO and burn it to a CD.

The first thing to note is that CentOS minimal install is so minimal miminal you do not even have network connectivity. That is the first thing you have to deal with so you can install other packages. This is easy to do by just editing the /etc/sysconfig/network-scripts/ifcfg-eth0 file for DHCP or the IP information of choice. Then you will be good to go to install more packages. Also be sure to add ONBOOT=YES or you will have to start the networking each time you boot.

So the key files you will need to edit are:
  • /etc/sysconfig/network-scripts/ifcfg-eth0
  • /etc/hosts
  • /etc/resolv.conf
OpenKM 6.2

4.1 OpenKM Installation and Configuration

Download the latest openkm from:
http://www.openkm.com/en/download-english.html

# chmod +x openkm-6.2.0-community-linux-x64-installer.run
# ./openkm-6.2.0-community-linux-x64-installer.run

… and follow the wizard instructions.

After the installation is done, execute the commands below:

Install the yum repositories needed for the installation of OpenKM

# cd /etc/yum.repos.d
# wget http://www.linux-mail.info/files/dag-clamav.repo
# wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
# wget http://ftp.jaist.ac.jp/pub/Linux/Fedora/epel/6/i386/epel-release-6-7.noarch.rpm
# rpm –import http://apt.sw.be/RPM-GPG-KEY.dag.txt
# rpm -K rpmforge-release-0.5.2-2.el6.rf.*.rpm
# rpm -i rpmforge-release-0.5.2-2.el6.rf.*.rpm
# rpm –K epel-release-6-7.noarch.rpm
# rpm –I epel-release-6-7.noarch.rpm
# yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel gcc gcc-c++ make autoconf libtool automake openoffice.org* gcc* automake zlib-devel libjpeg-devel giflib-devel freetype-devel ImageMagick amavisd-new spamassassin clamav clamd unzip bzip2 unrar perl-DBD-mysql –y

4.2 Install Java – Download the latest package of openjdk

# yum install java-1.7.0-openjdk-devel.i686 java-1.7.0-openjdk.i686

4.3 Install Tesseract OCR – Tesseract is a free software Optical Character Recognition engine for various operating systems. Tesseract is considered one of the most accurate free software OCR engines currently available.

# cd /root
# yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel gcc gcc-c++ make autoconf libtool automake –y
# wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz
# tar -zxvf leptonlib-1.67.tar.gz
# cd leptonlib1.67 && ./configure && make && make install
# wget http://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz
# tar -zxvf tesseract-3.00.tar.gz
# cd tesseract-3.00
# ./configure && make && make install
# cd /usr/local/share/tessdata
# wget http://tesseract-ocr.googlecode.com/files/deu.traineddata.gz
# gunzip deu.traineddata.gz

4.4 Install libreoffice as alternative for openoffice

# yum install libreoffice*
# soffice -headless -accept=”socket,host=127.0.0.1,port=8100;urp;” -nofirststartwizard &

4.5 Install swftools – Swftools is a collection of SWF manipulation and creation utilities released under the GPL. Cross-platform.

# wget http://www.swftools.org/swftools-0.9.1.tar.gz
# tar -zxvf swftools-0.9.1.tar.gz
# yum install gcc* automake zlib-devel libjpeg-devel giflib-devel freetype-devel
# cd swftools-0.9.1
# ./configure && make && make install

cd ..

4.6 Install ImageMagick – ImageMagick is an open source software suite for displaying, converting, and editing raster image files. It can read and write over 100 image file formats. ImageMagick is licensed under the Apache 2.0 license.

# yum install ImageMagick
# which convert

4.7 Install ClamAV – Clam AntiVirus (ClamAV) is a free, cross-platform antivirus software tool-kit able to detect many types of malicious software, including viruses. One of its main uses is on mail servers as a server-side email virus scanner.

# yum install amavisd-new spamassassin clamav clamd unzip bzip2 unrar perl-DBD-mysql

# sa-update
# chkconfig –levels 235 amavisd on
# chkconfig –levels 235 clamd on
# /usr/bin/freshclam
# /etc/init.d/amavisd start
# /etc/init.d/clamd start
# mkdir /var/run/amavisd /var/spool/amavisd /var/spool/amavisd/tmp /var/spool/amavisd/db
# chown amavis /var/run/amavisd /var/spool/amavisd /var/spool/amavisd/tmp /var/spool/amavisd/db
# ln -s /var/run/clamav/clamd.sock /var/spool/amavisd/clamd.sock

4.8 Configure OpenKM.cfg

# nano <openkm installation folder>/tomcat/conf/OpenKM.cfg
system.ocr=/usr/local/bin/tesseract
system.openoffice.server=http://localhost:8080/converter/convert
system.imagemagick.convert =/usr/bin/convert
system.swftools.pdf2swf = /usr/local/bin/pdf2swf -T 9 -f ${fileIn} -o ${fileOut}
system.antivir=/usr/bin/clamscan
hibernate.dialect=org.hibernate.dialect.HSQLDialect
hibernate.hbm2ddl=none
application.url=http://host:8080/OpenKM/com.openkm.frontend.Main/index.jsp
system.webdav.server=on
system.webdav.fix=on

4.9 Configure Server.xml and Run OpenKM

To allow OpenKM to be accessible from others computers in your network, try modify <openkm installation folder>/tomcat/conf/server.xml

Look for:<Connector address=””’127.0.0.1”'” connectionTimeout=”20000″ port=”8080″ protocol=”HTTP/1.1″ redirectPort=”8443″/> Change Connector address to 0.0.0.0 to allow OpenKM to all then save the changes. # cd <openkm installation folder>/tomcat/bin# ./Catalina.sh start

Tomcat will bind to all network interfaces of the computer. Now OpenKM can be accessed from another computer using http://your-domain.com:8080/OpenKM.

You can log into OpenKM with okmAdmin user (default password is “admin”).
 

Similar threads


Top Bottom