 |
DATA PREPARATION AND TABULATION OF THIRD CENSUS DATA ON SSI
UNITS
PROJECT EXECUTION METHODOLOGY
|
|
Design, Printing, Packing and Delivery of
ICR forms
Scanning
Data Capture Using Intelligent Character Recognition
(ICR)
Data Validation and Tabulation
Deliverables
|
The project, envisaged for processing of the filled-in formats
and data preparation of the Third Census of Small Scale Industries,
was unique in nature and needed a special methodology for execution.
The primary issues that were addressed in this project were:
-
- Designing and printing of large volume of ICR forms
- Packing and delivering the forms in small lots as per
requirements
- Automated high volume form processing using ICR technology
in limited time
- Technology to scan duplex documents with advanced features
such as image enhancement, provision to take dynamic data
only, provision for different recognition & trainable engines,
and to handle different styles of handwriting.
- Software development for automating all activities such
as scanning, data extraction, data validation and tabulation
- Data validation checks and tabulation
- Generation of hard copies and soft copies of the data
of desired quality
- Full proof, inbuilt security mechanisms to ensure the
integrity of the data and prevent data leakages
- M/s CS Software Enterprise Limited (CSEL) along with its
partners executed the project as per the scope of work prescribed
in the written contract with the company and as per the guidelines
given on day-to-day basis by the Census Cell of the Office
of the Development Commissioner (Small Scale Industries) led
by Shri M.V.S. Ranganadham, Director (Census). The procedures
adopted during the execution of the project were as follows.
- Design, Printing, Packing and Delivery of ICR forms
- The project required printing of approximately thirty lakhs
ICR forms in three formats to record the data pertaining to
the small-scale industrial units of both registered and unregistered
segments. The first milestone of the project was to design
the ICR form formats to capture all the required information.
It was very pertinent at that stage to envisage the variations
in the size and complexity of every unit of data to ensure
that the field survey team would not face problems in entering
the complete information. This needed a coordinated effort
from both the design team and the guidance of officers of
the DC (SSI) to design foolproof, user-friendly forms. Issues
such as requirements with respect to ICR technology, thickness
of the paper etc., were addressed at this stage.
- Once the form design was finalised, the printing of sample
ICR forms was taken up. The ICR forms were printed from a
Printing Press well versed with the job. The nature and quality
of ICR form printing played a vital role in the project. The
entire ICR scanning hinged on the form reading and any deviations
in the printing could have caused serious problems during
the data capture. Sample copies of the printed forms were
filled up with sample data and the data extraction procedures
and printing quality of the forms were comprehensively tested
before bulk-printing job was taken up.
- Once the forms were printed, they were checked for inadvertent
mistakes, such as black patches, blots, skewing of the printed
matter, etc., before they were packed and delivered to the
Directorates of Industries located at all the State Capitals
and Union Territories across the country.
- Once the printing of the forms was completed, the packing
assignment was taken up. Fifty forms were packed in one bundle
and each bundle was wrapped in polythene waterproof packets.
Care was taken to pack the different types of forms (Format
-I, Format -II and Format-III) in exact quantities as per
the guidelines of the DC (SSI). This enabled avoiding unwarranted
difficulties for the survey team in the field
- Scanning
- The filled-in forms after the survey were received in packets
containing fifty numbers at SISI office, Okhla, Delhi directly
from the respective District Industries Centres in different
batches. These were handed over to CSEL under acknowledgement,
by the concerned staff at SISI office, Okhla.
- After receiving the filled-in forms from the SISI, the
document packets containing approximately 50 forms were bundled
and labeled with a Batch number and Job number. The Job number
assigned was the date on which the filled-in packets were
received and Batch number was the serial number of the packets
received on said date.
- The data capture from the forms was a two-stage process.
First the forms were scanned with pre-set DPI settings and
the images were stored in an indexed directory created, using
Batch numbers and Job Numbers.

- The scanning of the forms was taken up using high-speed
scanners such as Fujitsu 4099D, Kodak DS 2500 and Kodak i260
to ensure that at least 1,25,000 forms were scanned on any
given day. The scanning was done using custom developed software
which ensured that all the forms scanned were of 150 dpi resolution
and were compressed to attain image size less than 60 KB per
document

- The custom built software application had the features of
checking for quality of the image, automatically binding page
one and page two of the documents into single image file and
of checking for missing pages, if any. The scanned images
were stored in directories on the hard disk using their Batch
number and Job number. The scanning software application,
at the time of scanning, checked for the mismatch of scanned
images and the count was mentioned, while creating the JOB
and Batch number. Wherever there was a mismatch, the entire
scanned images were deleted automatically and the operator
rescanned the bundle again. At the end of each day, the back
up of all the scanned images were taken on CDs.
- Data Capture Using Intelligent Character
Recognition (ICR)
- Traditionally data capture is done through manual data entry.
This age old process is not only tedious, time consuming but
is also prone to errors. In recent years new technologies
have been developed to capture data from handwritten forms
and printed documents. The most significant among them is
the Intelligent Character Recognition (ICR) for hand written
documents and Optical Mark Recognition (OCR) for data capture
from printed documents.
- Automated data capture and forms processing, whether paper-based
or electronic, is rapidly becoming an integral and necessary
component in the government, insurance and financial sectors.
It results in savings of 50 to 75 percent in direct costs
and a significant increase in productivity in comparison with
manually processed forms.
- In order to expedite the data capture and to improve the
accuracy levels of the data from over 26 lakhs of the filled-in
data forms, the ICR technology was used. Cardiff TELEform
8.0 software was used to extract the data from the scanned
images.
- Cardiff TELEform interprets handprint, machine print,
check boxes and bar codes from scanned images. After automatically
processing each form, TELEform highlights the illegible and
invalid entries for operator attention. Because, TELEform
processes the majority of the information, entry operators
spend seconds verifying questionable data rather than minutes
manually keying entire forms. Scanned forms are automatically
identified, eliminating the need for manual sorting. After
identifying a form, Cardiff Software’s Tri-CR® recognition
technology interprets the form’s hand print (ICR), machine
print (OCR), bar code and check box (OMR) fields.
- TELEform Reader runs in unattended mode enabling
forms to be processed continuously. TELEform’s automatic form
identification process handles multi-page forms, identifying
out of sequence and missing pages. The software also improves
the quality of scanned forms by performing despeckle, half-tone
removal, character smoothing and line thickening procedures.
- Tri-CR® leverages the strengths of multiple recognition
engines to produce unprecedented accuracy for hand print and
machine print data. The software engines examine the characters.
Tri-CR then analyzes the results, balances the strengths of
the individual engines and determines the correct interpretation
of data.
- TELEform Verifier highlights questionable data entries.
Three verification modes displaying characters, fields and
forms enable quick data correction. To work efficiently, Verifier
offers functions that ensure only accurate and complete data
makes it to the database. These functions include:
- Database validations
- User-defined dictionary look ups
- Numeric range tests
- Date, currency and character-specific formatting
- "Always review" and "Entry required" field designations
- TELEform includes a fully integrated Visual Basic programming
language called BasicScript™ that allows to customize validation
requirements. Using BasicScript arithmetic comparisons, financial
calculations, calls to external applications, skip and fill
logic, and other business logic routines can be incorporated.
- The data capture from the scanned images was carried out
using Cardiff Teleform 8.0. The data capture software used,
in addition to the features described above, has in built
facility to recognize the form type and interpret the data.
This feature helped in getting correct output even when all
three types filled in formats are mixed up due to various
reasons. The output generated by the ICR engine is comma-separated
values. The data thus obtained was ported into Oracle database
for carrying out the necessary validations.

- The data captured through the ICR process was verified
by the data entry operators to ensure the correctness of the
data. The operators checked every field against image. This
process ensured that the data captured by the ICR process
matched with the data on the document.
- Data Validation and Tabulation
- Once the data was captured, the defined validations were
applied through the software application. The nonconforming
data was retrieved and checked against the corresponding image
file using the software developed and was corrected. The corrected
data was again ported back into the database.

- The verified and validated data was analysed and multipliers
were generated for various formats under the guidance of the
Director (Census), DC (SSI), New Delhi. The Multipliers generated,
together with the validated data was used to generate the
required tables.

- Deliverables
- Once the database was created, the necessary tables were
generated using the custom-built software. Wherever defined
the hard copies of such reports were printed and delivered
to the DC (SSI). The soft copies of the reports and database
were also copied onto suitable medium such as CDs and DAT
tapes and were handed over to the DC (SSI) for archival purpose.
|
|
Related
Links:
Highlights of SSI Sector
Definitions
|
|
|