Menu
For free
Registration
home  /  Success stories/ Review of existing pattern recognition methods. Pattern recognition and features of living perception What are the difficulties of pattern recognition?

Review of existing pattern recognition methods. Pattern recognition and features of living perception What are the difficulties of pattern recognition?

An image is understood as a structured description of the object or phenomenon being studied, represented by a vector of features, each element of which represents numeric value one of the features characterizing the corresponding object.

The general structure of the recognition system is as follows:

The meaning of the recognition task is to establish whether the objects under study have a fixed finite set of features that allow them to be classified into a certain class. Recognition tasks have the following character traits:

1. These are information tasks consisting of two stages:

a. Reducing the source data to a form convenient for recognition.

b. Recognition itself is an indication that an object belongs to a certain class.

2. In these tasks, you can introduce the concept of analogy or similarity of objects and formulate the concept of proximity of objects as a basis for classifying objects into the same class or different classes.

3. In these tasks, you can operate with a set of precedents - examples, the classification of which is known and which in the form of formalized descriptions can be presented to the recognition algorithm to adjust to the task during the learning process.

4. These tasks are difficult to build for. formal theories and apply classical mathematical methods: often the information for an accurate mathematical model or the gains from using the model and mathematical methods are not commensurate with the costs.

5. In these tasks, “bad information” is possible - information with omissions, heterogeneous, indirect, fuzzy, ambiguous, probabilistic.

It is advisable to highlight following types recognition tasks:

1. Recognition task, that is, assigning a presented object according to its description to one of the given classes (supervised learning).

2. The task of automatic classification is the division of a set of objects (situations) according to their descriptions into a system of non-overlapping classes (taxonomy, cluster analysis, unsupervised learning).

3. The task of selecting an informative set of features during recognition.

4. The task of reducing the source data to a form convenient for recognition.

5. Dynamic recognition and dynamic classification - tasks 1 and 2 for dynamic objects.

6. Forecasting problem - problems 5 in which the decision must relate to some point in the future.

The concept of image.

An image, a class is a classification grouping in a system that unites (selects) a certain group of objects according to a certain criterion. Images have a number of characteristic properties, which manifest themselves in the fact that familiarization with a finite number of phenomena from the same set makes it possible to recognize as much as you like. big number its representatives.


A certain set of states of a control object can also be considered as an image, and this entire set of states is characterized by the fact that in order to achieve a given goal, the same impact on the object is required. Images have characteristic objective properties in the sense that different people, trained on different observational material, for the most part classify the same objects in the same way and independently of each other.

In general, the problem of pattern recognition consists of two parts: training and recognition.

Training is carried out by showing individual objects indicating their belonging to one or another image. As a result of training, the recognition system must acquire the ability to respond with the same reactions to all objects of the same image and with different reactions to all objects of different images.

It is very important that the learning process should be completed only by showing a finite number of objects without any other prompts. The learning objects can be either visual images or various phenomena outside world and others.

Training is followed by the process of recognizing new objects, which characterizes the action of an already trained system. Automation of these procedures is the problem of teaching pattern recognition. In the case when a person himself solves or invents, and then imposes classification rules on a computer, the recognition problem is partially solved, since the person takes on the main and main part of the problem (training).

The problem of teaching pattern recognition is interesting from both an applied and a fundamental point of view. From an applied point of view, solving this problem is important primarily because it opens up the possibility of automating many processes that until now have been associated only with the activity of the living brain. The fundamental significance of the problem is related to the question of what a computer can and cannot do in principle.

When solving control problems using pattern recognition methods, the term “state” is used instead of the term “image”. State – certain forms of displaying the measured current (instantaneous) characteristics of the observed object; a set of states determines the situation.

A situation is usually called a certain set of states of a complex object, each of which is characterized by the same or similar characteristics of the object. For example, if a certain control object is considered as an object of observation, then the situation combines such states of this object in which the same control actions should be applied. If the object of observation is a game, then the situation unites all states of the game.

The choice of the initial description of objects is one of the central tasks of the problem of learning pattern recognition. If the initial description (feature space) is successfully chosen, the recognition task may turn out to be trivial. Conversely, a poorly chosen initial description can lead to either very difficult further processing of information or no solution at all.

Geometric and structural approaches.

Any image that arises as a result of observing an object during training or an exam can be represented as a vector, and therefore as a point in some feature space.

If it is stated that when images are shown it is possible to unambiguously attribute them to one of two (or several) images, then it is thereby stated that in some space there are two or more regions that do not have common points, and that the image of the point is from these areas. Each point in such an area can be assigned a name, that is, a name corresponding to the image can be given.

Let us interpret the process of learning pattern recognition in terms of a geometric picture, limiting ourselves for now to the case of recognizing only two images. It is assumed that it is known in advance only that it is necessary to separate two regions in some space and that only points from these regions are shown. These areas themselves are not predetermined, that is, there is no information about the location of their boundaries or rules for determining whether a point belongs to a particular area.

During training, points randomly selected from these areas are presented, and information is provided about which area the presented points belong to. No additional information about these areas, that is, the location of their boundaries, is provided during training.

The goal of training is either to construct a surface that would separate not only the points shown during the training process, but also all other points belonging to these areas, or to construct surfaces that bound these areas so that each of them contains only points of one image. In other words, the goal of training is to construct functions from image vectors that would, for example, be positive at all points of one image and negative at all points of another image.

Due to the fact that the areas do not have common points, there is always a whole set of such separating functions, and as a result of training, one of them must be constructed. If the presented images belong not to two, but more images, then the task is to construct, using the points shown during training, a surface separating all the areas corresponding to these images from each other.

This problem can be solved, for example, by constructing a function that takes the same value over points in each of the regions, and over points from different regions the value of this function must be different.

It may seem that knowing just a few points from an area is not enough to isolate the entire area. Indeed, it is possible to specify an infinite number various areas, which contain these points, and no matter how the surface that selects the area is constructed from them, you can always specify another area that intersects the surface and at the same time contains the shown points.

However, it is known that the problem of approximating a function from information about it in a limited set of points is significantly narrower than the entire set on which the function is given, and is ordinary math problem on approximation of functions. Of course, solving such problems requires introducing certain restrictions on the class of functions under consideration, and the choice of these restrictions depends on the nature of the information that the teacher can add to the teaching process.

One such clue is the hypothesis of compactness of images.

Along with the geometric interpretation of the problem of teaching pattern recognition, there is another approach, which is called structural, or linguistic. Let's consider the linguistic approach using the example of visual image recognition.

First, a set of initial concepts is identified - typical fragments found in the image, and characteristics of the relative position of the fragments (on the left, below, inside, etc.). These initial concepts form a vocabulary that allows you to construct various logical statements, sometimes called sentences.

The task is to select from a large number of statements that could be constructed using these concepts the most significant ones for a given specific case. Next, viewing a finite and possibly small number of objects from each image, you need to construct a description of these images.

The constructed descriptions must be so complete as to resolve the question of which image a given object belongs to. When implementing a linguistic approach, two tasks arise: the task of constructing an initial dictionary, that is, a set of typical fragments, and the task of constructing description rules from elements of a given dictionary.

Within the framework of linguistic interpretation, an analogy is drawn between the structure of images and the syntax of language. The desire for this analogy was caused by the opportunity to use the apparatus of mathematical linguistics, that is, the methods are syntactic in nature. The use of the apparatus of mathematical linguistics to describe the structure of images can be used only after the images have been segmented into their component parts, that is, words have been developed to describe typical fragments and methods for searching for them.

After preliminary work ensuring the selection of words, the actual linguistic tasks arise, consisting of tasks of automatic grammatical parsing of descriptions for image recognition.

Compactness hypothesis.

If we assume that during the learning process, the feature space is formed based on the intended classification, then we can hope that the specification of the feature space itself specifies a property under the influence of which images in this space are easily separated. It was these hopes, as work in the field of pattern recognition developed, that stimulated the emergence of the compactness hypothesis, which states that images correspond to compact sets in the feature space.

By a compact set we mean certain clusters of points in image space, assuming that between these clusters there are rarefactions separating them. However, this hypothesis could not always be confirmed experimentally. But those tasks for which the compactness hypothesis was well fulfilled always found a simple solution, and vice versa, those tasks for which the hypothesis was not confirmed were either not solved at all, or were solved with great difficulty and the involvement of additional information.

The compactness hypothesis itself has become a sign of the possibility of satisfactorily solving recognition problems.

The formulation of the compactness hypothesis brings us close to the concept of an abstract image. If the coordinates of the space are chosen randomly, then the images in it will be distributed randomly. They will be more densely located in some parts of the space than in others.

Let's call some randomly selected space an abstract image. In this abstract space there will almost certainly exist compact sets of points. Therefore, in accordance with the compactness hypothesis, the set of objects to which compact sets of points correspond in an abstract space are usually called abstract images of a given space.

Training and self-learning, adaptation and training.

If it were possible to notice a certain universal property that does not depend either on the nature of images or on their images, but determines only the ability to be separated, then along with the usual task of learning recognition using information about the belonging of each object from the training sequence to one or another image, it is possible It would be possible to pose a different classification problem - the so-called unsupervised learning problem.

A task of this kind at the descriptive level can be formulated as follows: the system is simultaneously or sequentially presented with objects without any indication of their belonging to images. The input device of the system maps a set of objects onto a set of images and, using some property of image separability inherent in it in advance, produces an independent classification of these objects.

After such a self-learning process, the system should acquire the ability to recognize not only already familiar objects (objects from the training sequence), but also those that were not previously presented. The process of self-learning of a certain system is a process as a result of which this system, without prompting from a teacher, acquires the ability to develop identical reactions to images of objects of the same image and different reactions to images of different images.

The role of the teacher in this case is only to suggest to the system some objective property that is the same for all images and determines the ability to divide many objects into images.

It turns out that such an objective property is the property of compactness of images. The relative position of points in the selected space already contains information about how the set of points should be divided. This information determines the property of image separability, which is sufficient for the system to self-learn image recognition.

Most known self-learning algorithms are capable of identifying only abstract images, that is, compact sets in given spaces. The difference between them lies in the formalization of the concept of compactness. However, this does not reduce, and sometimes even increases, the value of self-learning algorithms, since often the images themselves are not defined in advance by anyone, and the task is to determine which subsets of images in a given space represent images.

An example of such a problem statement is sociological research, when groups of people are identified based on a set of questions. In this understanding of the problem, self-learning algorithms generate previously unknown information about the existence of images in a given space that no one previously had any idea about.

In addition, the result of self-learning characterizes the suitability of the selected space for a specific recognition learning task. If the abstract images identified in the self-learning space coincide with real ones, then the space has been chosen well. The more abstract images differ from real ones, the more inconvenient the chosen space is for a specific task.

Learning is usually called the process of developing in a certain system one or another reaction to groups of external identical signals through repeated exposure to the system of external adjustments. The mechanism for generating this adjustment almost completely determines the learning algorithm.

Self-learning differs from training in that here additional information about the correctness of the reaction to the system is not provided.

Adaptation is the process of changing the parameters and structure of the system, and possibly control actions, based on current information in order to achieve a certain state of the system under initial uncertainty and changing operating conditions.

Learning is a process as a result of which the system gradually acquires the ability to respond with the necessary reactions to certain sets of external influences, and adaptation is the adjustment of the parameters and structure of the system in order to achieve the required quality of control in the face of continuous changes in external conditions.


Speech recognition systems.

Speech acts as the main means of communication between people and therefore verbal communication is considered one of the most important components of an artificial intelligence system. Speech recognition is the process of converting an acoustic signal generated at the output of a microphone or telephone into a sequence of words.

A more difficult task is the task of understanding speech, which involves identifying the meaning of an acoustic signal. In this case, the output of the speech recognition subsystem serves as the input of the utterance understanding subsystem. Automatic speech recognition (ARR systems) is one of the areas of processing technologies natural language.

Automatic speech recognition is used to automate text entry into a computer, when generating oral queries to databases or information retrieval systems, when generating verbal commands to various intelligent devices.

Basic concepts of speech recognition systems.

Speech recognition systems are characterized by many parameters.

One of the main parameters is the word recognition error (WRO). This parameter is the ratio of the number of unrecognized words to the total number of spoken words.

Other parameters characterizing automatic speech recognition systems are:

1) dictionary size,

2) speech mode,

3) speech style,

4) subject area,

5) speaker addiction,

6) acoustic noise level,

7) quality of the input channel.

Depending on the size of the dictionary, APP systems are divided into three groups:

With a small dictionary size (up to 100 words),

With an average vocabulary size (from 100 words to several thousand words),

With a large dictionary size (more than 10,000 words).

Speech mode characterizes the way words and phrases are pronounced. There are systems for recognizing continuous speech and systems that allow recognizing only isolated words of speech. Isolated word recognition mode requires the speaker to pause briefly between words.

According to the style of speech, APP systems are divided into two groups: deterministic speech systems and spontaneous speech systems.

In deterministic speech recognition systems, the speaker reproduces speech following the grammatical rules of the language. Spontaneous speech is characterized by violations of grammatical rules and is more difficult to recognize.

Depending on the subject area, APP systems are distinguished that are focused on application in highly specialized areas (for example, access to databases) and APP systems with an unlimited scope of application. The latter require a large vocabulary and must provide recognition of spontaneous speech.

Many automatic speech recognition systems are speaker-dependent. This involves pre-tuning the system to the pronunciation features of a particular speaker.

The complexity of solving the problem of speech recognition is explained by the large variability of acoustic signals. This variability is due to several reasons:

Firstly, by the different implementation of phonemes - the basic units of the sound structure of a language. Variability in the implementation of phonemes is caused by the influence of neighboring sounds in the speech stream. The shades of phoneme realization determined by the sound environment are called allophones.

Secondly, the position and characteristics of the acoustic receivers.

Thirdly, changes in the speech parameters of the same speaker, which are caused by the different emotional state of the speaker and the pace of his speech.

The figure shows the main components of the speech recognition system:

The digitized speech signal is sent to a pre-processing unit, where the features necessary for sound recognition are extracted. Sound recognition is often done using artificial neural network models. The selected sound units are used later to search for a sequence of words, in to the greatest extent corresponding to the input speech signal.

Searching for a sequence of words is performed using acoustic, lexical and language models. Model parameters are determined from training data based on appropriate learning algorithms.

Speech synthesis from text. Basic Concepts

In many cases, the creation of artificial intelligence systems with elements of self-communication requires the output of messages in speech form. The figure shows a block diagram of an intelligent question-answer system with a speech interface:

Picture 1.

Take a piece of lectures from Oleg

Let us consider the features of the empirical approach using the example of part-of-speech recognition. The task is to assign labels to the words of the sentence: noun, verb, preposition, adjective, and the like. In addition, it is necessary to determine some additional features of nouns and verbs. For example, for a noun - a number, and for a verb - a form. Let's formalize the problem.

Let's imagine a sentence as a sequence of words: W=w1 w2…wn, where wn are random variables, each of which receives one of the possible values ​​belonging to the language dictionary. The sequence of labels assigned to the words of a sentence can be represented by the sequence X=x1 x2 ... xn, where xn are random variables whose values ​​are determined on the set of possible labels.

Then the task of part-of-speech recognition is to find the most probable sequence of labels x1, x2, ..., xn from a given sequence of words w1, w2, ..., wn. In other words, it is necessary to find a sequence of labels X*=x1 x2 … xn that provides the maximum conditional probability P(x1, x2, …, xn| w1 w2.. wn).

Let us rewrite the conditional probability P(X| W) in the following form P(X| W)=P(X,W) / P(W). Since it is required to find the maximum of the conditional probability P(X,W) for the variable X, we obtain X*=arg x max P(X,W). The joint probability P(X,W) can be written as a product of conditional probabilities: P(X,W)=product of u-1 to n from P(x i |x1,…,x i -1 , w1,…,wi -1 ) P(w i |x1,…,x i -1 , w1,…,w i -1). Directly searching for the maximum of a given expression is a difficult task, since for large values ​​of n the search space becomes very large. Therefore, the probabilities that are written in this product are approximated by simpler conditional probabilities: P(x i |x i -1) P(w i |w i -1). In this case, it is assumed that the value of the label x i is associated only with the previous label x i -1 and does not depend on earlier labels, and also that the probability of the word w i is determined only by the current label x i . These assumptions are called Markov assumptions, and to solve the problem the theory of Markov models is used. Taking into account the Markov assumptions, we can write:

X*= arg x1, …, xn max P i =1 n P(x i |x i -1) P(wi|wi-1)

Where conditional probabilities are estimated on a set of training data

The search for a sequence of labels X* is carried out using the Viterbi dynamic programming algorithm. The Viterbi algorithm can be considered as a variant of the search algorithm on a state graph, where the vertices correspond to word labels.

It is characteristic that for any current vertex the set of child labels is always the same. Moreover, for each child vertex, the sets of parent vertices also coincide. This is explained by the fact that transitions are carried out on the state graph taking into account all possible combinations of labels. Markov's assumptions provide a significant simplification of the problem of recognizing parts of speech while maintaining high accuracy in assigning labels to words.

Thus, with 200 labels, the assignment accuracy is approximately 97%. For a long time imperial analysis was performed using stochastic context-free grammars. However, they have a significant drawback. It lies in the fact that different grammatical parses can be assigned the same probabilities. This occurs because the parsing probability is represented as the product of the probabilities of the rules involved in the parsing. If during the analysis different rules are used, characterized by the same probabilities, then this gives rise to the indicated problem. The best results are obtained by a grammar that takes into account the vocabulary of the language.

In this case, the rules include the necessary lexical information that provides different probability values ​​for the same rule in different lexical environments. Imperial parsing is more similar to pattern recognition than to traditional parsing in its classical sense.

Comparative studies have shown that the accuracy of imperial parsing in natural language applications is higher than that of traditional parsing.

Methods of automatic image recognition and their implementation in optical character recognition systems (OCR systems) are one of the most advanced artificial intelligence technologies. Russian scientists occupy leading positions in the world in the development of this technology.

An OCR system is understood as a system for automatic pattern recognition using special programs for images of printed or handwritten text characters (for example, entered into a computer via a scanner) and converting it into a format suitable for processing by word processors, text editors, etc.

The abbreviation OCR sometimes stands for Optical Character Reader - a device for optical character recognition or automatic text reading. Currently, such devices in industrial use process up to 100 thousand documents per day.

Industrial use involves the entry of documents of good and medium quality - this is the processing of census forms, tax returns, etc.

Let us list the features of the subject area that are significant from the point of view of OCR systems:

  • font and size variety of symbols;
  • distortions in the images of symbols (breaks in the images of symbols);
  • distortions during scanning;
  • foreign inclusions in images;
  • combination of text fragments different languages;
  • a wide variety of character classes that can only be recognized with additional contextual information.

Automatic reading of printed and handwritten texts is a special case of automatic visual perception of complex images. Numerous studies have shown that for complete solution This task requires intellectual recognition, that is, “recognition with understanding.”

There are three principles on which all OCR systems are based.

  • 1. The principle of image integrity. The object under study always has significant parts between which there are relationships. The results of local operations with parts of the image are interpreted only together in the process of interpreting integral fragments and the entire image as a whole.
  • 2. The principle of purposefulness. Recognition is the purposeful process of making and testing hypotheses (finding what is expected of an object).
  • 3. The principle of adaptability. The recognition system must be capable of self-learning.

Leading Russian OCR systems: FineReader; FineReader Manuscript; FormReader; CunieForm (Cognitive Technologies), Cognitive Forms (Cognitive Technologies) .

The FineReader system is produced by ABBYY, which was founded in 1989. ABBYY's developments are carried out in two directions: computer vision and applied linguistics. Strategic direction scientific research and development is the natural language aspect of technologies in the field of computer vision, artificial intelligence and applied linguistics.

CuneiForm GOLD for Windows is the world's first self-learning intelligent OCR system that uses the latest adaptive text recognition technology and supports multiple languages. For each language, a dictionary is supplied for contextual checking and improving the quality of recognition results. Recognizes any printing, typewritten typefaces and fonts received from printers, with the exception of decorative and handwritten, as well as very low-quality texts.

Characteristics of pattern recognition systems. Among OC-technologies great importance have special technologies for solving certain classes of automatic pattern recognition problems:

  • search for people by photos;
  • search for mineral deposits and weather forecasting based on aerial photography and satellite images in various ranges of light radiation;
  • compilation geographical maps according to the initial information used in the previous task;
  • analysis of fingerprints and iris patterns in forensics, security and medical systems.

At the stage of preparing and processing information, especially during computerization of an enterprise and automation of accounting, the task arises of entering a large amount of text and graphic information into a PC. The main devices for entering graphic information are: a scanner, a fax modem and, less commonly, a digital camera. In addition, using optical text recognition programs, you can also enter (digitize) text information into a computer. Modern software and hardware systems make it possible to automate the entry of large amounts of information into a computer, using, for example, a network scanner and parallel text recognition on several computers simultaneously.

Most OCR programs work with raster images that are received via a fax modem, scanner, digital camera, or other device. At the first stage, the OSL system must divide the page into blocks of text, based on the features of right and left alignment and the presence of several columns. The recognized block is then split into lines. Despite its apparent simplicity, this is not such an obvious task, since in practice, distortion of the image of the page or its fragments when folded is inevitable. Even a slight tilt causes the left edge of one line to be lower than the right edge of the next, especially with tight line spacing. As a result, the problem arises of determining the line to which this or that image fragment belongs. For example, for letters

The lines are then divided into continuous image areas that correspond to individual letters; The recognition algorithm makes assumptions about the correspondence of these areas to characters, and then each character is selected, as a result of which the page is reconstructed in text characters, and, as a rule, in a given format. OBL systems can achieve the best recognition accuracy - over 99.9% for clean images composed of regular fonts. At first glance, this recognition accuracy seems ideal, but the error rate is still depressing, because if there are approximately 1500 characters on a page, then even with a 99.9% recognition success rate, there are one or two errors per page. In such cases, you should use the dictionary checking method, i.e. if a word is not in the system’s dictionary, then it will try to find a similar one according to special rules. But this still does not allow correcting 100% of errors and requires human control of the results.

Found in real life texts are usually far from perfect, and the percentage of recognition errors for “impure” texts is often unacceptably high. Dirty images are the most obvious problem because even small blemishes can obscure defining parts of a character or transform one into another. Another problem is inaccurate scanning due to the “human factor”, since the operator sitting at the scanner is simply not able to smooth out each scanned page and accurately align it with the edges of the scanner. If the document has been photocopied, breaks and merges of characters often occur. Any of these effects can cause the system to err because some of the OS systems assume that a contiguous image area must be a single character. An out-of-bounds or skewed page produces slightly distorted character images that can be confused by the OS system.

The OS system software usually works with a large bitmap image of the page received from the scanner. Standard resolution images are achieved by scanning at 9600 ppi. An A4 sheet image at this resolution takes up about 1 MB of memory.

The main purpose of OCR systems is to analyze raster information (scanned symbol) and assign the corresponding symbol to a fragment of an image. After completing the recognition process, OCR systems must be able to preserve the formatting of source documents, assign a paragraph attribute in the right place, save tables, graphics, etc. Modern recognition programs support all known text, graphic and spreadsheet formats, as well as HTML and PDF.

Working with OCR systems, as a rule, should not cause any particular difficulties. Most of these systems have a simple automatic “scan & read” mode, and they also support a mode for recognizing images from files. However, in order to achieve the best results possible for a given system, it is advisable (and often mandatory) to first manually configure it for a specific type of text, form layout and paper quality. A misaligned or skewed page produces slightly distorted character images that can be confused by an OCR system.

When working with an OCR system, it is very important to select the recognition language and type of material to be recognized (typewriter, fax, dot matrix printer, newspaper, etc.), as well as the intuitive clarity of the user interface. When recognizing texts that use several languages, the effectiveness of recognition depends on the ability of the OCR system to form groups of languages. At the same time, some systems already have combinations for the most commonly used languages, such as Russian and English.

On this moment There are a huge number of programs that support text recognition as one of the capabilities. The leader in this area is the FineReader system. The latest version of the program (6.0) now has tools for developing new systems based on FineReader 6.0 technology. The FineReader 6.0 family includes: FineReader 6.0 Professional, FineReader 6.0 Corporate Edition, FineReader Scripting Edition 6.0 and FineReader Engine 6.0. The FineReader 6.0 system, in addition to knowing a huge number of formats for saving, including PDF, has the ability to directly recognize PDF files. The new Intelligent Background Filtering technology allows you to filter out information about the texture of the document and the background noise of the image: sometimes a gray or colored background is used to highlight text in the document. This does not prevent a person from reading, but conventional text recognition algorithms experience serious difficulties when working with letters located on top of such a background. FineReader can identify areas containing similar text by separating the text from the background of the document, finding points that are smaller than a certain size, and removing them. In this case, the contours of the letters are preserved, so that background points located close to these contours do not introduce interference that could degrade the quality of text recognition.

Using the capabilities of modern layout programs, designers often create complex-shaped objects, such as wrapping multi-column text around a non-rectangular image. The FineReader 6.0 system supports the recognition of such objects and their saving in MS Word files. Now documents with complex layout will be accurately reproduced in this text editor. Even tables are recognized with maximum accuracy, while maintaining full editing capabilities.

The ABBYY FormReader system is one of the recognition programs from ABBYY, based on the ABBYY FineReader Engine system. This program is designed to recognize and process forms that can be filled out manually. ABBYY FormReader can handle forms with a fixed layout just as well as forms whose structure can change. For recognition, the new ABBYY FlexiForm technology was used.

Leading manufacturers software licensed Russian information technology for use with your products. The popular software packages Corel Draw (Corel Corporation), FaxLine/OCR & Business Card Wizard (Inzer Corporation) and many others have the CuneiForm OCR library built into them. This program became the first OCR system in Russia to receive the MS Windows Compatible Logo.

Readiris Pro 7 system - professional program text recognition. According to the manufacturers, this OCR system differs from analogues in the highest accuracy of converting ordinary (everyday) printed documents, such as letters, faxes, magazine articles, newspaper clippings, into objects accessible for editing (including PDF files). The main advantages of the program are: the ability to more or less accurately recognize images compressed “to the maximum” (with maximum loss of quality) using the JPEG format method, support for digital cameras and automatic detection of page orientation, support for up to 92 languages ​​(including Russian).

The OmniPage 11 system is a product of ScanSoft. A limited version of this program (OmniPage 11 Limited Edition, OmniPage Lite) is usually included with new scanners (in Europe and the USA). The developers claim that their program recognizes printed documents with almost 100% accuracy, restoring their formatting, including columns, tables, hyphens (including hyphenations of parts of words), headings, chapter titles, signatures, page numbers, footnotes, paragraphs, numbered lists, red lines, graphs and pictures. It is possible to save in Microsoft Office, PDF and 20 other formats, recognize from PDF files and edit in this format. The artificial intelligence system allows you to automatically detect and correct errors after the first manual correction. The new specially developed software module “Dcspeckle” allows you to recognize documents with degraded quality (faxes, copies, copies of copies, etc.). The advantage of the program is the ability to recognize color text and make adjustments by voice. A version of OmniPage also exists for Macintosh computers.

  • Cm.: Bashmakov A. I., Bashmakov I. A. Intelligent information technologies.
Image, class - a classification grouping in a classification system that unites (highlights) a certain group of objects according to some criterion.

The imaginative perception of the world is one of the mysterious properties of the living brain, which allows one to understand the endless flow of perceived information and maintain orientation in the ocean of disparate data about the outside world. When perceiving the external world, we always classify the perceived sensations, that is, we divide them into groups of similar, but not identical phenomena. For example, despite the significant difference, one group includes all the letters A, written in different handwritings, or all sounds that correspond to the same note, taken in any octave and on any instrument, and the operator controlling a technical object for the whole many states object reacts with the same reaction. It is characteristic that to formulate a concept about a group of perceptions of a certain class, it is enough to become familiar with a small number of its representatives. A child can be shown a letter just once so that he can find this letter in a text written in different fonts, or recognize it, even if it is written in a deliberately distorted form. This property of the brain allows us to formulate such a concept as an image.

Images have characteristic property, manifested in the fact that familiarization with a finite number of phenomena from the same set makes it possible to recognize an arbitrarily large number of its representatives. Examples of images can be: river, sea, liquid, music by Tchaikovsky, poetry by Mayakovsky, etc. A certain set of states of a control object can also be considered as an image, and this entire set of states is characterized by the fact that in order to achieve a given goal, the same impact on an object . Images have characteristic objective properties in the sense that different people, trained on different observational material, for the most part classify the same objects in the same way and independently of each other. It is this objectivity of images that allows people all over the world to understand each other.

The ability to perceive the external world in the form of images allows one to recognize with a certain reliability an infinite number of objects based on familiarization with a finite number of them, and the objective nature of the main property of images allows one to model the process of their recognition. Being a reflection of objective reality, the concept of an image is as objective as reality itself, and therefore can itself be an object of special study.

In the literature devoted to the problem of learning pattern recognition (PR), the concept of a class is often introduced instead of the concept of an image.

The problem of learning pattern recognition (PRT)

One of the most interesting properties of the human brain is its ability to respond to infinite set states external environment a finite number of reactions. Perhaps it was this property that allowed man to achieve the highest form of existence of living matter, expressed in the ability to think, i.e., actively reflect the objective world in the form of images, concepts, judgments, etc. Therefore, the problem of ORO arose during the study physiological properties brain

Let's consider an example of problems from the field of ODO.


Rice. 3.1.

There are 12 images presented here, and you should select features that can help you distinguish the left triad of pictures from the right. Solving these problems requires modeling logical thinking in full.

In general, the problem of pattern recognition consists of two parts: training and recognition. Training is carried out by showing individual objects indicating their belonging to one or another image. As a result of training, the recognition system must acquire the ability to respond with the same reactions to all objects of the same image and with different reactions to all objects of different images. It is very important that the learning process should be completed only by showing a finite number of objects without any other prompts. The learning objects can be either pictures or other visual images (letters), or various phenomena of the external world, for example, sounds, body conditions during a medical diagnosis, the state of a technical object in control systems, etc. It is important that only the objects themselves and their belonging to the image. Training is followed by the process of recognizing new objects, which characterizes the actions of an already trained system. Automation of these procedures is the problem of teaching pattern recognition. In the case when a person solves or invents it himself, and then imposes a classification rule on the machine, the recognition problem is partially solved, since the person takes on the main and main part of the problem (training).

The problem of teaching pattern recognition is interesting from both an applied and a fundamental point of view. From an applied point of view, solving this problem is important primarily because it opens up the possibility of automating many processes that until now have been associated only with the activity of the living brain. The fundamental significance of the problem is closely related to the question that increasingly arises in connection with the development of ideas in cybernetics: what can and what can a machine fundamentally not do? To what extent can the capabilities of a machine be close to those of a living brain? In particular, can a machine develop the ability to adopt a human ability to perform certain actions depending on situations that arise in the environment? So far, it has only become clear that if a person can first realize his skill himself, and then describe it, that is, indicate why he performs actions in response to each state of the external environment or how (by what rule) he combines individual objects into images, then such a skill can be transferred to a machine without fundamental difficulties. If a person has a skill, but cannot explain it, then there is only one way to transfer the skill to a machine - teaching by examples.

The range of problems that can be solved using recognition systems is extremely wide. This includes not only the tasks of recognizing visual and auditory images, but also the tasks of recognizing complex processes and phenomena that arise, for example, when choosing appropriate actions by the head of an enterprise or choosing the optimal management of technological, economic, transport or military operations. In each of these tasks, certain phenomena, processes, and states of the external world are analyzed, which are referred to below as objects of observation. Before you begin to analyze any object, you need to obtain certain, ordered information about it in some way. Such information represents the characteristics of objects, their display on a variety of perceptive organs of the recognition system.

But each object of observation can affect us differently, depending on the conditions of perception. For example, any letter, even written in the same way, can, in principle, be displaced in any way relative to the perceiving organs. In addition, objects of the same image can be quite different from each other and, naturally, have different effects on the perceiving organs.

Each mapping of an object onto the perceptive organs of the recognition system, regardless of its position relative to these organs, is usually called an image of the object, and sets of such images, united by some general properties, represent images.

When solving control problems using pattern recognition methods, the term “state” is used instead of the term “image”. State- this is a certain form of display of the measured current (or instantaneous) characteristics of the observed object. The set of states determines the situation. The concept of "situation" is analogous to the concept of "image". But this analogy is not complete, since not every image can be called a situation, although every situation can be called an image.

A situation is usually called a certain set of states of a complex object, each of which is characterized by the same or similar characteristics of the object. For example, if a certain control object is considered as an object of observation, then the situation combines such states of this object in which the same control actions should be applied. If the object of observation is a war game, then the situation combines all game states that require, for example, a powerful tank strike with air support.

The choice of the initial description of objects is one of the central tasks of the ODO problem. If the initial description (feature space) is successfully chosen, the recognition task may turn out to be trivial, and conversely, an unsuccessfully chosen initial description can lead to either very complex further processing of information or no solution at all. For example, if the problem of recognizing objects that differ in color is being solved, and signals received from weight sensors are chosen as the initial description, then the recognition problem cannot, in principle, be solved.

The general structure of the recognition system and the stages in the process of its development are shown in Fig. 4.

Rice. 4. Structure of the recognition system

Recognition tasks have the following characteristic features.

These are information tasks consisting of two stages: - transformation of source data to a form convenient for recognition; - recognition itself (indicating that an object belongs to a certain class).

In these problems, you can introduce the concept of analogy or similarity of objects and formulate rules on the basis of which an object is included in the same class or in different classes.

In these tasks, one can operate with a set of precedents, examples, the classification of which is known and which, in the form of formalized descriptions, can be presented to the recognition algorithm for adjustment to the task during the learning process.

For these problems it is difficult to build formal theories and apply classical mathematical methods (often information for an accurate mathematical model is not available or the benefits from using the model and mathematical methods are not commensurate with the costs).

The following types of recognition tasks are distinguished: - Recognition task - assigning a presented object according to its description to one of the given classes (supervised learning); - The task of automatic classification is the division of a set of objects, situations, phenomena according to their descriptions into a system of non-overlapping classes (taxonomy, cluster analysis, self-learning);

The task of selecting an informative set of features during recognition; - The task of bringing the source data to a form convenient for recognition; - Dynamic recognition and dynamic classification - tasks 1 and 2 for dynamic objects;

The forecasting problem is the previous type, in which the decision must relate to some point in the future.

Conclusion

Pattern recognition (and often said - objects, signals, situations, phenomena or processes) is the most common task that a person has to solve almost every second from the first to the last day of his existence. To do this, he uses the enormous resources of his brain, which we estimate by such an indicator as the number of neurons equal to 10 10.

Without even bothering with examples, you can notice that similar actions are observed in biology, in living nature, and sometimes even in inanimate nature. In addition, recognition is constantly encountered in technology. And if this is so, then, obviously, the recognition mechanism should be considered comprehensive.

From a more general point of view, it can be argued, and this is quite obvious, that in everyday activities a person is constantly faced with tasks related to making decisions due to a constantly changing environment. Participating in this process are: sensory organs, with the help of which a person perceives information from the outside; the central nervous system, which selects, processes information and makes decisions; motor organs that implement the decision made. But the basis for solving these problems is, as is easy to see, pattern recognition.

In their practice, people solve various problems of classification and recognition of objects, phenomena and situations (instantly recognize each other, read printed and handwritten texts at high speed, accurately drive cars in complex traffic flows, reject parts on a conveyor belt, solve codes, ancient Egyptian cuneiform, etc.).

Computations in networks of formal neurons are in many ways similar to information processing in the brain. In the last decade, neurocomputing has gained extreme popularity in the West, where it has already become an engineering discipline closely related to the production of commercial products. Every year dozens of books are published devoted to the practical aspects of neurocomputing. Work is intensively underway to create a new analog element base for neurocomputing.

In Russia, where, due to the general decline in the tone of scientific research, the structure of science has turned out to be “frozen,” there is still an opinion that traditional mathematical methods are, in principle, sufficient to solve any problem of pattern recognition. Neurocomputing is perceived as an excess and a tribute to short-term fashion. However, against the background of numerous practical successes of neurotechnologies, statements that any specific problem can, in principle, be solved and without them look somewhat scholastic. Since neurocomputing actually proves its competitiveness, it is wiser to take a closer look at this phenomenon. Are we risking, with our skepticism, to view the beginning of a new stage in the computer revolution? Will Russian computer science lag behind the world, this time completely, in this extremely rapidly developing and strategically important industry?

Prospects for the near future. The main feature that distinguishes neurocomputers from modern computers and ensures the future of this direction, according to the author, is the ability to solve informal problems for which, for one reason or another, solution algorithms do not yet exist. Neurocomputers offer a relatively simple technology for generating algorithms through learning. This is their main advantage, their “mission” in the computer world.

The ability to generate algorithms is especially useful for pattern recognition problems, where it is often difficult to distinguish significant signs a priori. That is why neurocomputing has turned out to be relevant right now, during the heyday of multimedia, when the development of the global Internet requires the development of new technologies closely related to image recognition. However, first things first.

One of the main problems in the development and application of artificial intelligence remains the problem of recognizing audio and visual images. However, the Internet and developed communication channels already make it possible to create systems that solve this problem using social networks, ready to help robots 24 hours a day.

The profession of an engineer for image recognition systems based on social networks will be in demand in the near future and until AI systems are able to pass the Turing test themselves.

Extrapolating from the exponential growth of technology over several decades, futurist Raymond Kurzweil has suggested that machines capable of passing the Turing test will not be produced until 2029 at the earliest.

However, AI systems cannot wait that long - all other technologies are already ready to find their application in medicine, biology, security systems, etc. Their eyes and ears will be millions of people around the world, ready to recognize a photograph of a terrorist, an inscription on a bottle of medicine, or words for help.

The audience of social networks is growing at a gigantic pace. According to ComScore research, in May 2009, Facebook alone had 70.28 million users in the United States. And this is almost twice as high as the same figure for May 2008.

The engineer’s job will be to organize the process of transmitting unrecognized visual or audio images to users in the form of MMS, pop-ups on websites, CAPTCHA symbols on forms on blogs, etc., verifying the received data and sending the recognized word or image back to the AI ​​system.

Annotation: We want to come to an understanding of the phenomenon of thinking, starting from the tasks of behavior and perception, that is, from the tasks for which the brain arose and evolved. In previous lectures we talked about behavior. Now let's see what the perception task provides for understanding the phenomenon of thinking. We will look at some principles of “intelligent” perception, which are illustrated by the example of solving the problem of automatically reading handwritten characters. Practical orientation did not lead, as often happens, to simplification and emasculation of the problem of perception. On the contrary, to obtain a workable solution, it was necessary to introduce “intelligent” components focused on recognition “with understanding.”

Pattern recognition

From the very beginning of the development of cybernetics, machine perception of images was most often chosen for the study and modeling of intelligence and, in particular, such obvious components of thinking as building a system of generalized knowledge about the environment and using this knowledge in the process decision making. The perception of visual information seemed to be the most convenient for modeling and at the same time the most practically significant.

It was immediately clear that to fully solve the problem of machine visual perception, “intelligent” recognition, or recognition “with understanding,” is necessary. Often they even tried to reduce thinking to perception, simply putting a sign of identity between them. We will see later that thinking and perception are inextricably linked, but they are far from the same thing. Therefore, studies of living perception (primarily visual) are certainly useful for understanding the thinking process, but the problem as a whole is far from solving the problem. At the same time, the practical orientation of work in the field of automatic analysis of visual information and the desire for technical feasibility have led to a serious transformation of the problem. It turned out to be almost almost forced to simplify the consideration of the process of perception by reducing it to classification according to the characteristics of simple objects considered separately. This direction became known as " Pattern recognition".

Pattern recognition to the direction " Artificial intelligence"(AI) was most often not included, since, unlike AI problems, a well-developed mathematical apparatus appeared in image recognition, and for not very complex objects, it turned out to be possible to build practically working recognition (classification) systems. As a result, the traditional pattern recognition, on the one hand, does not solve the problem of machine analysis of complex images and, on the other hand, is not a serious tool for modeling intelligence. Let us consider the issues related to this in more detail.

For any recognition, standards or models of classes of recognized objects are needed. Classification of recognition methods is possible according to the types of standards used or, which is almost the same, according to the method of representing objects at the input of the recognition system. Most image recognition systems typically use raster, feature, or structure methods.

The raster approach corresponds to standards that are images or some kind of image preparations. During recognition, the input image presented in the form of a dot raster is compared point to point with all the reference ones and it is determined which of the standards the image matches better, for example, has more common points. The input and reference images must be the same size and orientation. For example, in the so-called multifont-OCR (multifont printed text recognizers), this is achieved by constructing different standards not only for different fonts, but also for different character sizes (points) within the same font. Recognition of handwritten characters in this way is impossible due to their too great variability in shape, size and orientation.

It is also possible use case raster recognition with reduction of the input image to standard sizes and orientation. In this case, recognition of handwritten characters using the raster method becomes possible after clustering each recognized class and creating a separate raster standard for each cluster.

In general, obtaining invariance with respect to the size, shape and orientation of objects recognized by raster is a complex and often insoluble problem. Another problem arises from the need to isolate a fragment from an image that relates to a separate object. This problem is common to everyone classical methods pattern recognition.

In the vast majority of recognition systems and, in particular, in existing omnifont optical reading systems, the main method is the feature method. In the feature-based approach, standards are constructed using features identified in the image. The image at the input of the recognition system is represented by a feature vector. Anything can be considered as signs - any characteristics of recognizable objects. Features must be invariant to the orientation, size, and shape variations of objects. It is also desirable that feature vectors belonging to different objects of the same class belong to a convex compact region of the feature space. Feature space must be fixed and identical for all recognized objects. The alphabet of features is invented by the system developer. The quality of recognition largely depends on how well the alphabet of features is invented. There is no general method for automatically constructing an optimal alphabet of features.

Recognition consists of a priori obtaining a complete vector of features for any individual recognizable object selected in the image and only then determining which of the standards this vector corresponds to. Standards are most often constructed as statistical or geometric objects. In the first case, training may consist, for example, of obtaining a matrix of frequencies of occurrence of each feature in each class of objects, and recognition may consist of determining the probabilities of the feature vector belonging to each of the standards.

In the geometric approach, the result of training is most often the division of the feature space into areas corresponding to different classes of recognized objects, and recognition consists of determining which of these areas the input feature vector corresponding to the recognized object falls into. Difficulties in assigning an input feature vector to any area may arise if areas intersect, and also if the areas corresponding to individual recognized classes are not convex and are located in the feature space in such a way that the recognized class is not separated from other classes by a single hyperplane. These problems are most often solved heuristically, for example, by calculating and comparing distances (not necessarily Euclidean) in the feature space from the examined object to the centers of gravity of subsets of the training sample corresponding to different classes. More radical measures are also possible, for example, changing the alphabet of features or clustering the training set, or both at the same time.

The structural approach corresponds to standard descriptions constructed in terms of the structural parts of objects and spatial relationships between them. Structural elements are highlighted, as a rule, on the contour or “skeleton” of the object. Most often, a structural description can be represented by a graph that includes structural elements and the relationships between them. During recognition, a structural description of the input object is constructed. This description is compared with all structural standards, for example, graph isomorphism is found.

Raster and structural methods are sometimes reduced to a feature approach, considering in the first case image points as features, and in the second - structural elements and relationships between them. Let us immediately note that there is a very important fundamental difference between these methods. The raster method has the property of integrity. Structural method may have the property of integrity. The attribute method does not have the property of integrity.

What is integrity, and what role does it play in perception?

Classic pattern recognition usually organized as a sequential process unfolding “from the bottom up” (from image to understanding) in the absence of perception control from the upper conceptual levels. The recognition stage is preceded by the stage of obtaining an a priori description of the input image. Operations of selecting elements of this description, for example, features, or structural elements, are performed locally on the image, parts of the image receive an independent interpretation, that is, there is no holistic perception, which in general can lead to errors - a fragment of the image considered in isolation can often be interpreted completely differently depending on the hypothesis of perception, i.e. , what kind of holistic object is supposed to be seen.

Secondly, traditional approaches are focused on recognizing (classifying) objects considered individually. The stage of recognition itself must be preceded by the stage of segmentation (splitting) the image into parts corresponding to the images of individual objects being recognized. A priori segmentation methods typically exploit specific properties of the input image. There is no general solution to the pre-segmentation problem. Except for the most simple cases, the separation criterion cannot be formulated in terms of local properties of the image itself, i.e., before its recognition.

Lowercase, even handwritten, text is not the most difficult case, but even for such images, highlighting lines, words and individual characters in words can be a serious problem. The practical solution to this problem often relies on tinkering with segmentation options, which is completely different from what the human or animal brain does during holistic, goal-directed visual perception. Let us remember what Sechenov said: “We do not hear and see, but listen and look.” Such active perception requires holistic representations of objects at all levels - from individual parts to complete scenes - and the interpretation of parts only as part of the whole.

Thus, the disadvantages of most traditional approaches and, first of all, the attribute approach are the lack of integrity of perception, lack of focus and consistent unidirectional organization of the process “bottom up”, or from image to “understanding”.

Recognition is also possible using artificial or formal recognition neural networks (RNNs), shrouded in almost mystical fog. Sometimes they are even considered as some kind of analogue of the brain. Recently, texts simply write “neural networks,” omitting the adjectives “artificial” or “formal.” In fact, an RNN is most often just a feature classifier that builds separating hyperplanes in feature space.

The formal neuron used in these networks is an adder with a threshold element that calculates the sum of the products of feature values ​​by some coefficients, which are nothing more than the coefficients of the equation of the separating hyperplane in feature space. If the sum is less than the threshold, then the feature vector is on one side of the dividing plane, if it is more, on the other. That's all. Apart from constructing separating hyperplanes and classification based on characteristics, there are no miracles.

The introduction in a formal neuron, instead of a threshold jump from - 1 to 1, of a smooth (differentiable), most often sigma-shaped transition, does not fundamentally change anything, but only allows the use of gradient algorithms for training the network, that is, finding the coefficients in the equations of the dividing planes, and “smearing” the dividing plane boundaries, assigning the recognition result, that is, the work of a formal neuron near the boundary, a score, for example, in the range from 0 to 1. This score, to a certain extent, can reflect the “confidence” of the system in assigning the input vector to one or another of the shared regions of the feature space. At the same time, this estimate, strictly speaking, is neither a probability nor a distance to the separating plane.

A network of formal neurons can also approximate nonlinear dividing surfaces with planes and combine unconnected regions of the feature space based on the result. This is what is done in multilayer networks.

In all cases, a feature recognition formal neural network (PRNN) is a feature classifier that builds separating hyperplanes and selects areas in a fixed space of features (characteristics). The PRNS cannot solve any other problems, and the PRNS solves the recognition problem no better than conventional feature recognizers using analytical methods.

In addition, in addition to feature recognizers, raster, including ensemble, recognizers can be built on formal neurons. In this case, all the noted disadvantages of raster recognizers are preserved. True, there may be some advantages, which we will talk about later.

To avoid misunderstandings, it should be noted that it is, in principle, possible to build a universal computer on formal neurons, using both dividing planes in the space of variables and the logical functions AND, OR and NOT, easily implemented on formal neurons, but no one is building such computers and discussion of related with this issue goes beyond the scope of the problems under consideration. Neurocomputers are usually called either simply a neural recognizer, or special systems that solve problems close to pattern recognition and actually use recognition based on the construction of separating hyperplanes in feature space or based on comparison of a raster with a standard.

It was already noted above that for modeling thinking it is very important, and perhaps necessary, to understand how the neural mechanisms of the living brain work. In this regard, the question arises: aren’t formal recognition neural networks, if not a solution to the problem of modeling the neural mechanisms of the brain, then at least an important step in this direction? Unfortunately, the answer must be no. Unlike an active living neural network, RIS is a passive feature or raster classifier with all the disadvantages of traditional classifiers. We will consider the arguments on the basis of which this conclusion was made in more detail later.

So, traditional, primarily feature-based, recognition systems, based on the sequential organization of the process of recognition and classification of objects considered separately, cannot effectively solve the problem of perceiving complex visual information, mainly due to the lack of integrity and purposefulness of perception, lack of integrity in descriptions (standards) of recognized objects and sequential organization of the recognition process. For the same reason, such pattern recognition systems provide little insight into live visual perception and the thinking process.