Menu
For free
Registration
home  /  Success stories/ Definition of statistics and their levels. Naive Bayes classifier for indicator set signals

Definition of statistics and their levels. Naive Bayes classifier for indicator set signals

Cross-validation is a well-known method for obtaining estimates of unknown model parameters. The main idea of ​​the method is to divide the data sample into v “folds”. V "folds" are randomly selected isolated subsamples.

Using a fixed value of k, a k-nearest neighbors model is built to obtain predictions on the v-th segment (the remaining segments are used as examples) and the classification error is estimated. For regression problems, the sum of squares is most often used as an error estimate, and for classification problems it is more convenient to consider accuracy (the percentage of correctly classified observations).

The process is then repeated sequentially for all possible choices v. Upon exhaustion of v “folds” (cycles), the calculated errors are averaged and used as a measure of model stability (i.e., a measure of the quality of prediction at query points). The above steps are repeated for different k, and the value corresponding to the smallest error (or highest classification accuracy) is accepted as optimal (optimal in the sense of the cross-validation method).

Please note that cross-validation is a computationally intensive procedure and it is necessary to allow time for the algorithm to run, especially if the sample size is large enough.

The second option for choosing the value of parameter k is to set its value yourself. However, this method should be used if there are reasonable guesses about the possible value of the parameter, such as previous studies of similar data sets.

The k-nearest neighbors method shows quite good results in a wide variety of problems.

An example of a real use of the above method is software Dell Technical Support Center, developed by Inference. This system helps center staff answer more requests by immediately offering answers to common questions and allowing them to access the database while on the phone with the user. Technical support center employees, thanks to the implementation of this method, can answer a significant number of calls simultaneously. Software CBR is currently deployed on the Dell Intranet.

There are not too many Data Mining tools that implement the k-nearest neighbors method and the CBR method. Among the most famous: CBR Express and Case Point (Inference Corp.), Apriori (Answer Systems), DP Umbrella (VYCOR Corp.), KATE tools (Acknosoft, France), Pattern Recognition Workbench (Unica, USA), as well as some statistical packages, for example, Statistica.

Bayesian classification

Alternative names: Bayesian modeling, Bayesian statistics, Bayesian network method.

You can learn more about Bayesian classification in. Bayesian classification was originally used for formalization of knowledge experts in expert systems, now Bayesian classification is also used as one of the Data Mining methods.

The so-called naive classification or naive-bayes approach is the simplest version of the method using Bayesian networks. With this approach, classification problems are solved; the result of the method is the so-called “transparent” models.

"Naive" classification is a fairly transparent and understandable classification method. It is called “naive” because it starts from the assumption that mutual independence signs.

Properties of naive classification:

  1. Using all variables and defining all dependencies between them.
  2. Having two assumptions about the variables:
    • all variables are equally important;
    • all variables are statistically independent, i.e. the value of one variable says nothing about the value of another.

Most other classification methods assume that, before classification begins, an object is equally likely to belong to one class or another; but this is not always true.

Let's say we know that a certain percentage of the data belongs to a particular class. The question arises, can we use this information when building a classification model? There are many real-life examples of using this prior knowledge to help classify objects. A typical example from medical practice. If a doctor sends a patient's test results for additional testing, he assigns the patient to a specific class. How can you apply this information? We can use it as additional data when building a classification model.

The following advantages of Bayesian networks as a Data Mining method are noted:

  • the model defines the dependencies between all variables, this makes it easy to handle situations in which the values ​​of some variables are unknown;
  • Bayesian networks are quite simply interpreted and allow easy analysis of “what if” scenarios at the predictive modeling stage;
  • the Bayesian method allows you to naturally combine patterns inferred from data and, for example, expert knowledge obtained explicitly;
  • the use of Bayesian networks avoids the problem of overfitting, that is, excessive complication of the model, which is a weakness of many methods (for example, decision trees and neural networks).

The Naive Bayes approach has the following disadvantages:

  • It is correct to multiply conditional probabilities only when all input variables are truly statistically independent; although this method often shows quite good results when the condition of statistical independence is not met, theoretically this situation should be handled by more complex methods based on training Bayesian networks;
  • Direct processing of continuous variables is impossible - they need to be converted to an interval scale so that the attributes are discrete; however, such transformations can sometimes lead to the loss of significant patterns;
  • The classification result in the Naive Bayes approach is influenced only by the individual values ​​of the input variables; the combined influence of pairs or triplets of values ​​of different attributes is not taken into account here. This could improve the quality of the classification model in terms of its predictive accuracy, however, it would increase the number of options tested.

Bayesian classification has found wide application in practice.

Bayesian word filtering

Recently, Bayesian classification was proposed for personal spam filtering. The first filter was developed by Paul Graham. For the algorithm to work, two requirements must be met.

The first requirement is that the object being classified has a sufficient number of characteristics. This is ideally satisfied by all words in user letters, with the exception of very short and very rare ones.

The second requirement is constant retraining and replenishment of the “spam - not spam” set. Such conditions work very well in local email clients, since the flow of “not spam” from the end client is quite constant, and if it changes, it does not change quickly.

However, for all clients of the server it is quite difficult to accurately determine the “non-spam” flow, since the same letter that is spam for one client is not spam for another. The dictionary turns out to be too large, there is no clear division into spam and “not spam”, as a result, the quality of classification, in this case the solution to the problem of filtering letters, is significantly reduced.

In this part we will not talk about recommender systems as such. Instead, we will specifically focus on the main tool of machine learning - Bayes' theorem - and look at one simple example of its application - the naive Bayes classifier. Disclaimer: I’m unlikely to tell a reader familiar with the subject anything new here; we’ll talk mainly about the basic philosophy of machine learning.


Bayes' theorem is either remembered or can be trivially deduced by anyone who has taken even the most basic course in probability theory. Remember what the conditional probability of an event is x subject to event y? Right by definition: , where is the joint probability x And y, A p(x) And p(y) - the probabilities of each event separately. This means that the joint probability can be expressed in two ways:
.

Well, here's Bayes' theorem:

You probably think that I’m mocking you - how can a trivially tautological rewriting of the definition of conditional probability be the main tool of anything, especially such a large and non-trivial science as machine learning? However, let's start to understand; First, let's just rewrite Bayes' theorem in different notations (yes, yes, I continue to mock):

Now let's relate this to a typical machine learning problem. Here D is the data, what we know, and θ is the model parameters we want to train. For example, in the SVD model, the data are the ratings that users gave to products, and the parameters of the model are the factors that we train for users and products.

Each of the probabilities also has its own meaning. - this is what we want to find, the probability distribution of the model parameters after how we took the data into account; it is called posterior probability(posterior probability). This probability, as a rule, cannot be found directly, and this is where Bayes’ theorem is needed. - this is the so-called credibility(likelihood), probability of data given fixed model parameters; this is usually easy to find; in fact, the design of the model usually consists of specifying a likelihood function. A - prior probability(prior probability), it is a mathematical formalization of our intuition about a subject, a formalization of what we knew before, even before any experiments.

This is probably not the time or place to delve into this, but the merit of the Reverend Thomas Bayes was, of course, not in rewriting the definition of conditional probability in two lines (there were no such definitions then), but precisely in putting forward and develop such a view of the very concept of probability. Today, the “Bayesian approach” refers to the consideration of probabilities from the standpoint of “degrees of confidence” rather than the friquentist (from the word frequency, not freak!) “proportion of successful experiments as the total number of experiments tends to infinity.” In particular, this allows us to talk about the probabilities of one-time events - after all, in fact, there is no “number of experiments tending to infinity” for events like “Russia will become the world champion in football in 2018” or, closer to our topic, “You will like the movie” Tractor drivers""; It’s more like with a dinosaur: you either like it or not. Well, mathematics, of course, is the same everywhere; Kolmogorov’s axioms of probability don’t care what people think about them.

To consolidate what has been covered, here is a simple example. Let's consider the task of text categorization: for example, suppose that we are trying to sort a news stream by topic based on an existing database with topics: sports, economics, culture... We will use the so-called bag-of-words model: represent a document as a (multi)set the words it contains. As a result, each test case x takes values ​​from the set of categories V and is described by attributes . We need to find the most likely value of this attribute, i.e.

According to Bayes' theorem,

It’s easy to estimate: we’ll simply estimate the frequency of its occurrence. But it won’t be possible to evaluate different ones - there are too many of them - this is the probability of exactly the same set of words in messages on different topics. Obviously, there is nowhere to get such statistics.

To deal with this, a naive Bayes classifier (sometimes even called idiot’s Bayes) assumes conditional independence of attributes given the value of the objective function:

Now it is much easier to train individual ones: it is enough to calculate the statistics of the occurrence of words in categories (there is one more detail that leads to two different versions of naive Bayes, but we will not go into details now).

Note that the Naive Bayes classifier makes a pretty damn strong assumption: in text classification, we assume that different words in a text about the same topic appear independently of each other. This, of course, is complete nonsense - but, nevertheless, the results are quite decent. In fact, the Naive Bayes classifier is much better than it seems. His probability estimates are optimal, of course, only in the case of true independence; but the classifier itself is optimal in a much wider class of problems, and here’s why. Firstly, the attributes are, of course, dependent, but their dependence is the same for different classes and “mutually cancels” when estimating probabilities. The grammatical and semantic dependencies between words are the same both in the text about football and in the text about Bayesian learning. Secondly, for estimating probabilities, naive Bayes is very bad, but as a classifier it is much better (usually, even if in fact and , naive Bayes will produce and , but the classification will be more often correct).

In the next series, we will complicate this example and consider an LDA model that is capable of identifying topics in a corpus of documents without any set of tagged documents, so that one document can contain several topics, and also apply it to the recommendation problem.

Whether we like it or not, statistics play a significant role in trading. From fundamental news full of numbers to trading reports or testing reports, there is no escape from statistical indicators. At the same time, the thesis about the applicability of statistics in making trading decisions remains one of the most controversial topics. Is the market random, are quotes stationary, is a probabilistic approach applicable to their analysis? You can argue about this endlessly. On the Internet, and even on the website, it is easy to find materials and discussions with a variety of points of view, rigorous scientific calculations and impressive graphs. However, traders, as a rule, are interested in the applied aspect - how it all works in practice, in the trading terminal. This article is an attempt to demonstrate a pragmatic approach to a probabilistic trading decision model using a set of technical indicators. Minimum theory, maximum practice.

The idea is to evaluate the potential of various indicators from the perspective of probability theory and test the ability of a committee of indicators to increase the winning percentage of a trading system.

This will require the creation of a framework for processing signals from arbitrary indicators and a simple expert based on it for testing.

It is planned to use standard indicators as working indicators, but the framework will allow you to independently connect and analyze other, custom indicators.

But before we start designing and implementing algorithms, we still have to dive a little into the theory.

Introduction to the Conditional Probability Model

The title of the article mentions the Naive Bayes classifier. It is based on the well-known Bayes formula, which will be briefly discussed here, and is called “naive” due to the necessary assumption of the independence of the random variables described by the formula. We will talk about the independence of indicators later, but for now let’s talk about the formula itself.

where H is a certain hypothesis about the internal state of the system (in our case, this is a hypothesis about the state of the market and trading system), E is an observed event (in our case, these are indicator signals), as well as the probabilities describing them:

  • P(H) is the a priori probability of state H, known from the history of observations;
  • P(E) - the total probability of event E taking into account all existing hypotheses, of which there are usually several (here it should be noted that the hypotheses must be inconsistent, i.e. there is only one state of the system at each moment, but for those who want to delve deeper into the theory, links);
  • P(E|H) - probability of occurrence of event E if hypothesis (state) H is true;
  • P(H|E) - posterior probability of hypothesis (state) H when observing event E.

If we take the simplest trading system as an example, market states such as upward movement (buying), downward movement (selling) and sideways fluctuations (expectation) are usually considered as hypotheses H. Indicator signals are used as events E that describe the probable state of the market.

For signals of a specific indicator, it is easy to calculate the probabilities from the right side of formula (1) on the available history and then find out the most probable state of the market P(H|E).

However, for the calculation it is necessary to more clearly define the hypotheses and the methodology for collecting statistics, on the basis of which the probabilities will be obtained.

First of all, let's assume that trading is carried out by bars (not by ticks). The performance of trading can be assessed by the size of profit, profit factor or other characteristics, but for simplicity of presentation, we will take the number of winning and losing entries into the market. This will directly link the system's assessment to the probability of successful trades (spent signals).

We will also limit ourselves to a trading system without take profit and stop loss levels, without stop loss support and without changing the lot. All these parameters can be introduced into the model, but would significantly complicate probability calculations by turning them into multivariate distributions. The only parameter of the trading system will be the duration of holding the position in bars. In other words, after entering the market in the direction selected using indicators, exit is made automatically after a predetermined time. This approach is good because it puts emphasis on the correctness or falsity of the hypothesis about the growth or fall of the quote. Thus, we will test the hypothesis in its pure form, without lifebuoys and straw underneath.

To complete the theme of simplifications, we will make two more cardinal moves.

It was said above that “buy”, “sell” and “wait” are usually taken as trading hypotheses. By discarding "expectation", we would noticeably reduce the calculations without loss of generality. It may seem that such simplifications will negatively affect the applicability of the result obtained, and to some extent this is true. However, if you pay attention to how much material is still left to read even with such simplifications, you will probably agree that it would be nice to just get a working model at first, and you can add details to it later, gradually. Those wishing to build more complex models that take into account probability densities can find relevant works on the Internet, including in English, such as Reasoning Methods for Merging Financial Technical Indicators, which describes a hybrid probabilistic decision-making system.

Finally, the second and final cardinal move is to combine the “buy” and “sell” statuses into one, but with a universal meaning - “entry into the market.” Usually we use multidirectional indicator signals symmetrically, in a similar way, i.e., for example, overbought according to the indicator becomes a signal to sell, and oversold - a signal to buy.

In other words, hypothesis H now sounds like a successful entry into the market in either of two directions (buy or sell).

Under these conditions, the calculation of probabilities from the right side of formula (1) can be performed on the selected history of quotes as follows.

because on any bar there is an opportunity to successfully enter the market - one of the directions will be profitable (we are neglecting the spread here, because the working time frame will be D1, which is discussed in more detail below).

P(E) = number of bars with indicator signals / total number of bars

P(E|H) = number of bars with indicator signals that coincided with a profitable trading direction / total number of bars

After simplification, we obtain a formula for calculating the history of the probability that the signal of the selected indicator indicates the conditions for opening a successful transaction:

(2)

where Nok is the number of correct signals, Ntotal is the total number of signals.

A little later we will implement a framework that will allow us to calculate this probability for any indicator. As we will see, this probability is usually close to 0.5, and some research must be done to find conditions where it is consistently greater than 0.5. However, indicators with a large indicator are rare. For standard indicators, which we will study first, this probability ranges from 0.51-0.55. It is clear that such values ​​are too small and are more likely to allow you to “stay with your own” than to steadily increase your deposit.

To solve this problem, it is necessary to use not one indicator, but several. This solution in itself is not new; it is used by most traders. But the probability theory will make it possible to conduct a quantitative analysis of the effectiveness of indicators in various combinations and evaluate the potential effect.

Formula (1) for the case of three indicators (A, B, C) will look like this:

(3)

We need to bring it to a form convenient for algorithmic calculation. Fortunately, Bayesian theory is used in many industries, and therefore there is a ready-made recipe for our case.

In particular, there is such a direction as Bayesian spam filtering. We don't have to understand it thoroughly. Only the fundamental concepts are important. A document (for example, an email message) is marked as spam if it contains certain characteristic words. The general occurrence of words in a language and the probabilities of finding them in spam are known, just as we know the general probabilities of indicator signals and the percentage of their “hits on the mark.” In other words, it is enough to replace the “spam” hypothesis with a “successful transaction”, and the “word” event with an “indicator signal”, so that the theory of spam processing fits completely into our theory of probabilistic trading.

Then formula (3) can be expanded through the probabilities of individual indicators as follows (for calculations, see the link above):

Calculations of P(H|A), P(H|B), P(H|C) are performed according to formula (2) for each indicator separately.

Of course, if necessary, formula (4) can be easily expanded to any number of indicators. To roughly understand how the number of indicators affects the probability of a correct trading decision, let’s assume that all indicators have an equal probability value:

Then formula (4) will take the form:

(5)

where N is the number of indicators.

A graph of this function for various values ​​of N is shown in Figure 1.


Rice. 1 Type of joint probability with different numbers of random variables

So, with p = 0.51 we get P(3) = 0.53, which is not particularly impressive, but with p = 0.55 - P(3) = 0.65, and this is already a noticeable improvement.

Independence of indicators

The formulas discussed above are based on the assumption of independence of the analyzed random processes, which in our case are indicator signals. But is this condition met?

It is obvious that some indicators, including many from the list of standard ones, have a lot in common. As a visual illustration, Figure 2 shows some of the built-in indicators.

Rice. 2 Groups of similar standard indicators

It is easy to notice that the Stochastic and WPR indicators for the same period, superimposed on each other in the last window, actually repeat each other. This is not surprising since their formulas are arithmetically equivalent.

The MACD and Awesome Oscillator indicators shown just above in the screenshot are identical, adjusted for the type of moving averages. In addition, since both are built on moving averages (MA), they cannot be called independent of the MAs themselves.

RSI, RVI, CCI are also highly correlated. It should be noted that almost all standard oscillators are similar; the correlation coefficients will be close to 1.

There is also notable overlap among volatility indicators, particularly ATR and StdDev.

All this is important to take into account when creating a set of indicators for a trading system, since the real effect of a committee of dependent indicators will in practice be much lower than the expected theoretical one.

By the way, a similar situation arises when training neural networks. With their help, traders often try to process data from many voluntarily chosen indicators. However, feeding dependent vectors to the input of networks significantly reduces the efficiency of training, since the computing power of the network is wasted. The volume of analyzed data may seem large, but the information contained in it is duplicated and meaningless.

A rigorous approach to this problem requires calculating the correlation between indicators and compiling sets with the smallest pairwise values. This is a separate large area of ​​research. Those interested can find articles on this topic on the Internet. Here we will be guided by general considerations based on the observations stated above. For example, one of the sets may look like this: Stochastic, ATR, AC (Acceleration/Deceleration) or WPR, Bollinger Bands, Momentum.

It should be clarified here that the Acceleration/Deceleration (AC) indicator is essentially a derivative of the oscillator. Why is he suitable for inclusion in the group?

Let's imagine a series of quotes (or an oscillator derived from them) in a simplified form as periodic fluctuations, for example, cosine or sine. Recall that the derivatives of these functions are equal, respectively:

(6)

The correlations between these functions and their derivatives are zero.

Therefore, using the first derivative of an indicator is generally a good candidate for consideration as an additional independent indicator.

The second derivative is a dubious candidate in such oscillatory processes, because the chances of obtaining a replica of the original signal are high.

Concluding the conversation about the independence of indicators, it makes sense to dwell on the question of whether copies of the indicator calculated with different periods can be considered independent.

It can be assumed that the answer depends on the ratio of periods. A slight difference obviously preserves the dependence of the indicators, and therefore a noticeable difference is required. This is partly consistent with classical methods, such as Elder's three screen method, where time frames that differ, as a rule, by at least 5 times, are equivalent to analyzing indicators with different time periods.

It should be noted that in the system under consideration, the independent quantities should not actually be the readings of the indicators, but the trading signals generated by them. However, for most indicators of the same type, for example, oscillators, the principles of generating trading signals are similar, and therefore a strong or weak dependence of time series is equivalent to a strong or weak dependence of signals.

Design

So, we have understood the theory and are ready to tackle the question of what and how to code.

We will collect statistics of indicator trading signals in a special expert. In order for an expert to be able to trade based on the readings of arbitrary indicators, it will be necessary to develop a framework (in fact, a mqh header file) that will receive a description of the indicators used and how to generate signals based on them through input parameters. For example, we should be able to set two moving averages of different periods in the parameters and generate buy and sell signals, when the fast MA crosses the slower one up and down, respectively.

The Expert Advisor will have explicit control over the opening of bars and will trade only at opening prices. This is not a real expert, but a tool for calculating probabilities and testing hypotheses. It is important for us that the verification takes place quickly, because there are an unlimited number of options for sets of indicators.

D1 will be used as the default working timeframe. Of course, no one will prohibit further analysis on any other timeframe, but D1 is the least susceptible to random noise, and the analysis of patterns that have existed for several years most fully meets the specifics of the probabilistic approach. In addition, for trading strategies on D1, the spread can usually be neglected, which neutralizes our refusal to support the intermediate state of the “wait” system. For intraday trading, of course, such an assumption could not be made, and it would be necessary to calculate the probability of a larger number of hypotheses.

As mentioned earlier, the expert will open positions based on indicator signals and close them after a predetermined period of time. To do this, we introduce the corresponding input parameter. Its default value will be 5 days. This is the characteristic period for the D1 time frame and is used in many trading research papers that also use D1.

The EA and the framework will be cross-platform, that is, they will be compiled and executed in both MetaTrader 4 and MetaTrader 5. This feature will be provided through existing, publicly available header wrapper files that allow the use of the MetaTrader 4 MQL API syntax in the MetaTrader environment 5, and in addition, in some cases we will use conditional compilation: specific parts of the code will be wrapped in preprocessor directives #ifdef __MQL4__ and #ifdef __MQL5__.

Implementation in MQL

Framework for indicators

We will begin our review of the framework for processing indicator signals by discussing what types of indicators we will need. The most obvious listing includes all built-in indicators, as well as an item for custom indicators iCustom. An enumeration will be needed to select indicators using the framework's input parameters.

enum IndicatorType ( iCustom , iAC , iAD , tADX_period_price, tAlligator_jawP_jawS_teethP_teethS_lipsP_lipsS_method_price, iAO , iATR_period, tBands_period_deviation_shift_price, iBearsPower_period_price, iBullsPower_period_price , iBWMFI , iCCI_period_price, iDeMarker_period, tEnvelopes_period_method_shift_price_deviation, iForce_period_method_price, dFractals, dGator_jawP_jawS_teethP_teethS_lipsP_lipsS_method_price, fIchimoku_tenkan_kijun_senkou, iMoment um_period_price, iMFI_period, iMA_period_shift_method_price, dMACD_fast_slow_signal_price, iOBV_price, iOsMA_fast_slow_signal_price , iRSI_period_price, dRVI_period, iSAR_step_maximum, iStdDev_period_shift_method_price, dStochastic_K_D_slowing_method_price, iWPR_period );

The name of each built-in indicator contains a suffix with information about the parameters of the indicator itself. The first character of the element indicates the number of available buffers, for example, i - one buffer, d - two, t - three. All these are just tips for the user. If he specifies the wrong number of parameters or the index of a non-existent buffer, the framework will display an error in the log.

Of course, in the input parameters you will need to specify for each indicator not only its type, but also the actual parameters in the form of a string, the buffer number and the number of the bar from which the data will be read.

Based on the indicator readings, it is necessary to generate signals. In principle, there can be many different ones, but we will bring together the main options in another listing.

enum SignalCondition ( Disabled, NotEmptyIndicatorX, SignOfValueIndicatorX, IndicatorXcrossesIndicatorY, IndicatorXcrossesLevelX, IndicatorXrelatesToIndicatorY, IndicatorXrelatesToLevelX );

Thus, signals can be generated:

  • if the indicator value is not empty;
  • the indicator value has the required sign (positive or negative);
  • the indicator crosses another indicator, and here it should be noted that when describing the signal, we must provide the opportunity to specify 2 indicators;
  • the indicator crosses a certain level, and here it becomes clear that there should be a field for entering the level;
  • the indicator is positioned as required relative to another indicator (for example, above or below);
  • the indicator is positioned in the required manner relative to the specified level;

The first element - Disabled - allows you to disable any signal generation condition. We will provide several identical groups of input parameters to describe the signals, and each signal will be disabled by default.

From the names of the items in the previous listing, it can be assumed that it is necessary to somehow set the required sign of the values ​​and the position of the lines relative to each other. For this purpose, let's add one more enumeration.

enum UpZeroDown ( EqualOrNone, UpSideOrAboveOrPositve, DownSideOrBelowOrNegative, NotEqual );

EqualOrNone allows you to check:

  • empty value in combination with SignOfValueIndicatorX
  • level equality in combination with IndicatorXrelatesToLevelX

UpSideOrAboveOrPositve allows you to check:

  • bottom-up crossing using IndicatorXcrossesIndicatorY
  • positive value using SignOfValueIndicatorX
  • crossing a level from bottom to top using IndicatorXcrossesLevelX
  • growth of indicator values ​​on successive bars using IndicatorXrelatesToIndicatorY, if X and Y are the same indicator
  • placing X over Y using IndicatorXrelatesToIndicatorY if X and Y are different indicators
  • positioning the indicator above the level using IndicatorXrelatesToLevelX

DownSideOrBelowOrNegative allows you to check:

  • top-down crossing using IndicatorXcrossesIndicatorY
  • to a negative value using SignOfValueIndicatorX
  • crossing a level from top to bottom using IndicatorXcrossesLevelX
  • dropping indicator values ​​on consecutive bars using IndicatorXrelatesToIndicatorY if X and Y are the same indicator
  • placing X under Y using IndicatorXrelatesToIndicatorY if X and Y are different indicators
  • positioning the indicator below the level using IndicatorXrelatesToLevelX

NotEqual allows you to check:

  • inequality to level (value) using IndicatorXrelatesToLevelX

When the signal is triggered, it needs to be processed. To do this, we will define a special enumeration.

enum SignalType ( Alert , Buy, Sell, CloseBuy, CloseSell, CloseAll, BuyAndCloseSell, SellAndCloseBuy, ModifyStopLoss, ModifyTakeProfit, ProceedToNextCondition );

The main actions for processing signals are shown here: displaying a message, buying, selling, closing all open orders (buy, sell, or both), reversing from sell to buy, reversing from buy to selling, modifying stop loss or take levels -profit, as well as transition to checking the next condition (signal). The last point allows you to build a signal check in a chain (for example, check whether the main buffer has crossed the signal line, and if so, then further check whether this happened above or below a certain level).

You may notice that there is no setting of pending orders in the list of actions. This is left outside the scope of this work. Those interested can expand the framework.

Having all these enumerations, it is possible to describe several groups of attributes with the help of which operational indicators are set. One group looks like:

input IndicatorType Indicator1Selector = iCustom ; // · Selector input string Indicator1Name = "" ; // · Name input string Parameter1List = "" /*1.0,value:t,value:t*/ ; // · Parameters input string Indicator1Buffer = "" ; // · Buffer input int Indicator1Bar = 1 ; // · Bar

The Indicator1Name parameter is intended to specify the name of a custom indicator when iCustom is set in Indicator1Selector.

The Parameter1List parameter allows you to specify indicator parameters as a string, separated by commas. The type of each input parameter will be recognized automatically, for example, 11.0 - double, 11 - int, 2015.01.01 20:00 - date/time, true/false - bool, "text" - string. Some parameters - for example, moving average types or price types - can be specified not as a number, but as a string without quotes (sma, ema, smma, lwma, close, open, high, low, median, typical, weighted, lowhigh, closeclose).

Indicator1Buffer - number or name of the buffer without quotes. Supported buffer names are main, signal, upper, lower, jaw, teeth, lips, tenkan, kijun, senkouA, senkouB, chikou, +di, -di.

Indicator1Bar - bar number, default is 1.

Once the indicators have been determined, signals can be generated based on them, i.e. conditions for triggering events. Each signal is specified by a group of input parameters.

input string __SIGNAL_A = "" ; input SignalCondition ConditionA = Disabled; // · Condition A input string IndicatorA1 = "" ; // · Indicator X for signal A input string IndicatorA2 = "" ; // · Indicator Y for signal A input double LevelA1 = 0 ; // · Level X for signal A input double LevelA2 = 0 ; // · Level Y for signal A input UpZeroDown DirectionA = EqualOrNone; // · Direction or sign A input SignalType ExecutionA = Alert ; // Action A

For each signal, you can specify an identifier in the __SIGNAL_ parameter.

Using Condition, you select the condition for checking the signal. Next, one or two indicators and one or two level values ​​are set (the second level is reserved for the future and will not be used in this experiment). Indicators in the Indicator parameters are either the indicator number from the corresponding attribute group, or the indicator prototype in the form:

indicatorName@buffer(param1,param2,...)

This form of recording allows you to quickly determine the indicator used without describing it in detail using a group of attributes. For example,

iMA@0(1,0,sma,high)

gives high price values, and on each current bar of a working expert, bar number 1 is taken (the most recently completed one, for which the high price is finally known).

Thus, indicators can be specified both in selected attribute groups (for subsequent reference to them from signals by number) and directly in signals in the Indicator parameter (X or Y). The first method is convenient if the same indicator needs to be used in different signals or as X and Y within one signal.

The Direction parameter specifies the direction or sign of the value for the condition to fire. In accordance with Execution, one or another action is performed when the signal is activated.

Inside the framework, it is now determined that an indicator cannot have more than 20 parameters, the maximum number of selected groups with indicator attributes is 6 (but, as we remember, indicators can additionally be set directly in the signal), and signals - 8. All this can be changed in the source code. The IndicatN.mqh file is attached at the end of the article.

Also in this file, in the form of several classes, all the logic for parsing indicator parameters, calling them, checking conditions and returning the verification results to the calling code (which will be our expert) is implemented.

In particular, to convey instructions about the need to perform any action from the SignalType enumeration discussed above, a simple public class TradeSignals is used with logical fields corresponding to the enumeration items:

class TradeSignals ( public : bool alert; bool buy; bool sell; bool buyExit; bool sellExit; bool ModifySL; bool ModifyTP; int index; double value ; string message; TradeSignals(): alert(false), buy(false), sell (false ), buyExit(false ), sellExit(false ), ModifySL(false ), ModifyTP(false ), value (EMPTY_VALUE), message("" )() );

When the necessary conditions are met, the fields are set to true. For example, if the CloseAll action is selected, the buyExit and sellExit flags are set in the TradeSignals object.

The index field contains the serial number of the triggered condition.

Using the value field, you can pass an arbitrary value, for example, a new stop loss level obtained from the indicator values.

Finally, the message field contains a message to the user describing the situation.

Details of the implementation of all classes can be found in the source code. It uses the auxiliary header files fmtprnt2.mqh (formatted log output) and RubbArray.mqh (rubber array), which are also included.

The framework header file IndicatN.mqh should be included in the expert code using the #include directive. As a result, after compilation in the expert settings dialog, we will receive groups of input parameters with indicator attributes:

Fig.3 Setting up indicators

and with signal definitions:

Fig.4 Setting up trading signals

The screenshots show properties already filled in. We will look at these in detail when we move on to the concept of an Expert Advisor and begin setting up specific trading strategies. It’s worth noting here that when setting indicator attributes, you can replace any numeric parameters with an expression like =var1, =var2, etc. to 9. They refer to special framework input parameters of the same name var1, var2, etc., intended for optimization. For example, the entry:

iMACD@main(=var4,=var5,=var6,open)

means that the parameters of the MACD fast, slow and signal moving average periods can be optimized through the input parameters var4, var5 and var6 respectively. And even with optimization disabled, during single testing, the values ​​of the corresponding indicator attributes will be read from the specified input parameters of the framework.

Test Expert

To make coding easier, we will move all trading functions into a special class and design it as a separate header file Expert0.mqh. Since we are going to test fairly simple trading systems, the class will only allow opening and closing positions.

Thus, all routine operations with indicators and related to trading are placed in header files.

#include<IndicatN.mqh >#include<Expert0.mqh >

Directly in the indstats.mq4 expert file there will be very few lines of code and simple logic.

Since the Expert Advisor must compile and work in MetaTrader 5 after changing the extension to mq5, we will add header files that ensure the transfer of codes to the new environment.

#ifdef __MQL5__ #include #include #include #endif

Now let's turn to the input parameters of the Expert Advisor.

input int ConsistentSignalNumber = 1 ; input int Magic = 0 ; input float Lot = 0.01 f; input int TradeDuration = 1 ;

Magic and Lot are required to create an Expert object from the Expert0.mqh file.

Expert e(Magic, Lot);

The ConsistentSignalNumber parameter will contain the number of trading signals that we are trying to combine to improve reliability.

The TradeDuration parameter specifies the number of bars during which the open position will be held. As mentioned above, we will open trades based on signals and exit them after 5 bars, i.e. days, since the D1 timeframe is used.

In the OnInit event handler, we will initialize the indicator framework.

int OnInit () ( return IndicatN::handleInit(); )

In the OnTick handler we will provide control over the opening of the bar.

void OnTick () ( static datetime lastBar; if (lastBar != Time) ( const RubbArray *ts = IndicatN::handleStart(); ... lastBar = Time; ) )

When forming a new bar, we will check all the indicators and the conditions associated with them by calling the indicator framework again. As a result, we get an array of triggered signals - TradeSignals objects.

Now it's time to talk about the accumulation of statistics.

Each condition (event) of the framework, if it occurs, generates a signal with the alert flag by default. We will use this to count the number of signals from indicators, as well as the number of realized system states, i.e. cases (bars) when buying or selling would be successful.

To calculate statistics, we will describe arrays.

int bars = 0 ; // total count of bars/samples int bull = 0, bear = 0; // number of bars/samples per trade type int buy = (0), sell = (0); // unconditional signals arrays int buyOnBull = (0), sellOnBear = (0); // conditional (successful) signals arrays

In our case of bar trading, each bar is a potential new entry into a trade lasting 5 bars. Each such segment is characterized by a rise or fall in quotes and is marked as bullish or bearish, respectively.

All buy and sell signals will be summed up in the buy and sell arrays, and if the corresponding signal coincides with the “bullness” or “bearishness” of the segment, that is, it is successful, it is also accumulated in the buyOnBull or sellOnBear array, depending on the type .

To fill the arrays, let's write the following code inside OnTick.

const RubbArray *ts = IndicatN::handleStart(); bool up = false , down = false ; int buySignalCount = 0 , sellSignalCount = 0 ; for (int i = 0 ; i< ts.size(); i++) { // alerts are used to collect statistics if (ts[i].alert) ( // while setting up events, enumerated by i, // hypothesis H_xxx should come first, before signals S_xxx, // because we assign up or down marks here if (IndicatN::GetSignal(ts[i].index) == "H_BULL" ) ( bull++; buy.index]++; up = true ; ) else if (IndicatN::GetSignal(ts[i].index) == "H_BEAR" ) ( bear++; sell.index]++; down = true ; ) else if (StringFind (IndicatN::GetSignal(ts[i].index), "S_BUY" ) == 0 ) ( buy. index]++; if (up) ( if (PrintDetails) Print ("buyOk " , IndicatN::GetSignal(ts[i].index)); buyOnBull.index]++; ) ) else if (StringFind (IndicatN: :GetSignal(ts[i].index), "S_SELL" ) == 0 ) ( sell.index]++; if (down) ( if (PrintDetails) Print ("sellOk " , IndicatN::GetSignal(ts[i ].index)); sellOnBear.index]++; ) ) if (PrintDetails) Print (ts[i].message); ) )

Having received an array of triggered signals, we go through its elements in a loop. If the alert flag is set, this is statistics collection.

Before analyzing the code more deeply, let's introduce a special convention for naming signals (events). Hypotheses about bullish or bearish market conditions will be marked with the identifiers H_BULL and H_BEAR. These events should be determined using the input parameters of the framework first - before other events (indicator signals). This is necessary in order to establish the corresponding signs - logical variables up and down - based on confirmed hypotheses.

Indicator signals must have identifiers starting with S_BUY or S_SELL.

As you can see, using a reference to the number of the activated event ts[i].index, we obtain its identifier through a call to the GetSignal function. If the hypotheses are realized, we update the general counters of bullish or bearish sections. In the case of generating signals, we keep their total count for each type of signal, as well as their success index, that is, the number of matches with the current hypotheses.

Recall that either the H_BULL hypothesis or the H_BEAR hypothesis is true on every bar.

In addition to collecting statistics, the expert must support trading based on signals. For this purpose, we will supplement the body of the loop with a check for the buy and sell flags.

if (ts[i].buy) ( buySignalCount++; ) else if (ts[i].sell) ( sellSignalCount++; )

After the cycle, we implement the trading functionality. First of all, we will close open positions (if any) after the specified period.

if (e.getLastOrderBar() >= TradeDuration) ( e.closeMarketOrders(); ) if (buySignalCount >= ConsistentSignalNumber && sellSignalCount >= ConsistentSignalNumber) ( Print ("Signal collision" ); ) else if (buySignalCount >= ConsistentSignalNumber) ( e.closeMarketOrders(e.mask(OP_SELL )); if (e.getOrderCount(e.mask(OP_BUY )) == 0 ) ( e.placeMarketOrder(OP_BUY ); ) ) else if (sellSignalCount >= ConsistentSignalNumber) ( e. closeMarketOrders(e.mask(OP_BUY )); if (e.getOrderCount(e.mask(OP_SELL )) == 0 ) ( e.placeMarketOrder(OP_SELL ); ) )

If the signals to buy and sell contradict each other, we skip this condition. If the number of buy or sell signals is equal to or greater than the predefined ConsistentSignalNumber, open the corresponding order.

It should be noted that by setting ConsistentSignalNumber to a lower value than the number of configured signals, it will be possible to test trading the system in the mode of combining all or most strategies. In normal operation, the expert will use intersection rather than union, since to find joint events, ConsistentSignalNumber must be exactly equal to the number of signals. For example, with 3 signals configured and ConsistentSignalNumber equal to 3, trading will be carried out only when all three events occur simultaneously. If ConsistentSignalNumber is set to 1, then trades will be opened upon receipt of any (at least one) of the 3 signals.

In the OnDeinit handler, we will display the collected statistics on alerts or order history in the log.

The full source code of the Expert Advisor can be viewed in the indstats.mq4 file.

All signals must be tested against two hypotheses of buying or selling. To do this, we will configure the H_BULL and H_BEAR signals, as well as their indicators.

To obtain bar prices, use the iMA indicator with a period of 1. In the __INDICATOR_1 group, set:

Selector = iMA_period_shift_method_price

Parameters = 1,0,sma,open

In the __INDICATOR_2 group we will make similar settings except for the bar number - there you should enter 5, the number of bars that we will use in the TradeDuration parameter.

In other words, in the statistics collection mode, the expert does not trade, but analyzes the change in quotes between the 5th and 0th bar, as well as indicator signals on the 5th or 6th bar, depending on the type of price used: for indicators working at open prices, you can take values ​​from bar 5, and for all others - from 6. In the statistics collection mode, bar number 5 is a virtual current bar, and all subsequent ones provide information about the “future” implementation of bullish or bearish market hypotheses.

Let’s immediately make a reservation that in the trading mode we will take signals from bar 0 (if the indicator is built at the open price) or bar 1 (in other cases). If the expert did not work based on opening prices and analyzed ticks, it would be necessary to look at the indicator values ​​at bar 0 in this mode.

The presence of these two modes - collecting statistics and trading - implies the need to create different sets of parameters that differ in the numbers of working bars. We will start with a set for collecting statistics, and then easily convert it into a real trading one.

Using these two copies of the MA indicator, we will set up hypotheses. In the __SIGNAL_A group we enter:

SIGNAL_A = H_BULL

Condition = IndicatorXrelatesToIndicatorY Indicator X = 1 Indicator Y = 2 Direction or sign = UpSideOrAboveOrPositve Action = Alert

Let's configure the __SIGNAL_B group in the same way, except for the direction:

SIGNAL_B = H_BEAR
Direction or sign = DownSideOrBelowOrNegative

To test the probabilistic trading model, we will use 3 standard strategies based on indicators:

  • Stochastic
  • BollingerBands

Let us note in advance that the parameters of all indicators have been optimized, some of which were specifically left in the form of links to input variables var1, var2, etc. to demonstrate this framework capability. To replicate positive results with your provider's data, each strategy will likely need to be re-optimized.

Stochastic Strategy consists of buying when the indicator crosses level 20 from bottom to top and selling when crossing level 80 from top to bottom. To do this, we define the group __INDICATOR_3:

Selector = dStochastic_K_D_slowing_method_price

Parameters = 14,3,3,sma,lowhigh Buffer = main Bar = 6

Since the indicator uses high and low prices, it is necessary to take bar number 6 - the last one fully formed before bar 5, where virtual trading begins if the signal is triggered.

Based on the Stochastic indicator, we will set up buy and sell signals. Group for purchase:

SIGNAL_C = S_BUY stochastic

Condition = IndicatorXcrossesLevelX Level X = 20 Direction or sign = UpSideOrAboveOrPositve

Group for sale:

SIGNAL_D = S_SELL stochastic

Condition = IndicatorXcrossesLevelX Level X = 80 Direction or sign = DownSideOrBelowOrNegative

MACD strategy consists of buying when the main line crosses the signal line up and selling when it crosses down.

Let's configure the indicator group __INDICATOR_4:

Selector = dMACD_fast_slow_signal_price

Parameters = =var4,=var5,=var6,open Buffer = signal Bar = 5

We will read the periods fast, slow, signal from the parameters var4, var5, var6, available for optimization. There are now 6, 21, 6 respectively. We use bar 5 because... We build the indicator based on the open price.

Since the number of groups for setting up indicators is limited, we will describe the main buffer directly in the signals. Group for purchase:

SIGNAL_E = S_BUY macd

Condition = IndicatorXcrossesIndicatorY Indicator X = iMACD@main(=var4,=var5,=var6,open) Indicator Y = 4 Direction or sign = UpSideOrAboveOrPositve

Group for sale:

SIGNAL_F = S_SELL macd

Condition = IndicatorXcrossesIndicatorY Indicator X = iMACD@main(=var4,=var5,=var6,open) Indicator Y = 4 Direction or sign = DownSideOrBelowOrNegative

BollingerBands based strategy consists of buying when the high of the previous bar breaks the upper line of the indicator shifted to the right by 2 bars, and selling when the low of the previous bar breaks the lower line of the indicator shifted to the right by 2 bars. Below are the settings of the two indicator lines.

Parameters = =var1,=var2,2,typical

Buffer = upper Bar = 5

Selector = tBands_period_deviation_shift_price

Parameters = =var1,=var2,2,typical Buffer = lower Bar = 5

The period and deviation are specified in var1 and var2 as 7 and 1 respectively. In both cases, bar 5 can be used, despite the typical price type, because the indicator lines are shifted 2 bars to the right, i.e. are actually calculated on past data.

Finally, the signal settings groups look like this.

SIGNAL_G = S_BUY bands

Condition = IndicatorXcrossesIndicatorY Indicator X = iMA@0(1,0,sma,high) Indicator Y = 5 Direction or sign = UpSideOrAboveOrPositve

SIGNAL_H = S_SELL bands

Condition = IndicatorXcrossesIndicatorY Indicator X = iMA@0(1,0,sma,low) Indicator Y = 6 Direction or sign = DownSideOrBelowOrNegative

All settings in the form of set files are attached at the end of the article.

results

Statistics on indicators

To calculate the probabilities, we use statistics for the period 2014.01.01-2017.01.01 for the EURUSD D1 pair. Expert settings for the statistics collection mode are contained in the indstats-stats-all.set file.

The collected data is displayed in the log. Below is an example:

: bars=778 : bull=328 bear=449 : buy: 328 0 30 0 50 0 58 0 : buyOk: 0 0 18 0 29 0 30 0 : sell: 0 449 0 22 0 49 0 67 : sellOk: 0 0 0 14 0 28 0 41 : totals: 0.00 0.00 0.60 0.64 0.58 0.57 0.52 0.61 : Stats by name: : macd=0.576 : bands=0.568 : stochastic=0.615

The total number of bars is 778, of which 328 were suitable for a successful 5-day buy trade, and 449 were suitable for a successful 5-day sell trade. The first 2 columns contain hypothesis counters - the same 2 numbers, and the next pairs of columns refer to the corresponding trading strategies, each of which is represented by a buy column and a sell column. For example, the stochastic strategy generated 30 buy signals, 18 of which were profitable, and 22 sell signals, 14 of which were profitable. If we sum up the total number of successful signals for each strategy and divide by the number of generated signals, we obtain the effectiveness (probability of success based on historical data) of each of them:

  • Stochastic - 0.615
  • MACD - 0.576
  • Bands - 0.568
Test trading

To make sure that the statistics are calculated correctly, you need to run the Expert Advisor in trading mode. To do this, you need to edit the bar numbers in the settings, replacing 5 with 0, 6 with 1. In addition, you should sequentially enable trading strategies one after another by setting the Action parameters to Buy and Sell instead of Alert. For example, to test stochastic trading, in the __SIGNAL_C (S_BUY stochastic) group, replace the Alert value in the Action parameter with the Buy value, and in the __SIGNAL_D (S_SELL stochastic) group, replace the Alert value with Sell.

The corresponding settings for all 3 strategies are given, respectively, in the files indstats-trade-stoch.set, indstats-trade-macd.set, indstats-trade-bands.set.

By running the expert 3 times with these sets of parameters, we will get 3 logs with brief trading reports. Statistics are at the very end. For example, for stochastic we get the line:

: Buys: 18/29 0.62 Sells: 14/22 0.64 Total: 0.63

These are the numbers about real transactions: 18 purchases out of 29 are profitable, 14 sales out of 22 are profitable, the overall signal efficiency is 0.63.

The results of the MACD and BollingerBands strategies are shown below.

: Buys: 29/49 0.59 Sells: 28/49 0.57 Total: 0.58

: Buys: 29/51 0.57 Sells: 34/59 0.58 Totals: 0.57

Let's summarize the indicators of all strategies into one list.

  • Stochastic - 0.63
  • MACD - 0.58
  • Bands - 0.57

Here we see almost complete agreement with the theory from the previous subsection. Some of the difference is due to the fact that trading signals can overlap each other if they are within 5 bars, and then a repeat trade is not opened.

Of course, we can analyze the trading reports for each strategy.


Fig.5 Strategy report based on the Stochastic indicator


Fig.6 Strategy report based on the MACD indicator


Fig.7 Strategy report based on the BollingerBands indicator

Using formula (4), we calculate the theoretical probability that the transaction will be successful when entering based on synchronous signals of all three indicators.

P(H|ABC) = 0.63 * 0.58 * 0.57 / (0.63 * 0.58 * 0.57 + 0.37 * 0.42 * 0.43) = 0.208278 / (0.208278 + 0.066822) = 0.208278 / 0.2751 = 0.757

To test this situation, we must enable all three signals, and also change the value of the ConsistentSignalNumber parameter from 1 to 3. The corresponding settings are in the indstats-trade-all.set file.

According to the trading test in the tester, the final efficiency of such a system in practice is 0.75:

: Buys: 4/7 0.57 Sells: 5/5 1.00 Total: 0.75

Here is the test report:


Fig.8 Report on a combination of strategies based on 3 indicators

Below is a table of trading indicators for each of the indicators separately and for their superposition.


Profit,$ PF N DD,$
Stochastic 204 2.36 51 41
MACD 159 1.39 98 76
Bands 132 1.29 110 64
Total 68 3.18 12 30

As we can see, the increase in the probability of winning is achieved due to rarer, but more accurate entries - the number of transactions and total profit have decreased, but the profit factor and maximum drawdown have improved by at least 35%, and in some cases - more than 2 times.

Conclusion

The article discusses the simplest option for implementing a probabilistic approach to making trading decisions based on indicator signals. With the help of a special expert, it was shown that theoretical calculations of increasing the probability of successful transactions using the Bayes formula correspond to the results obtained in practice.

Since the generation of signals is discrete, the signals of different indicators may not coincide and a situation is potentially possible when the superposition of indicators does not give common signals confirmed by all indicators. One possible solution to this problem is to introduce a time tolerance between signals.

In a more general case, it is possible to calculate the probability density of the implementation of trading hypotheses depending on the state (and not signals) of indicators. For example, the overbought or oversold value determined by a specific oscillator value gives the percentage (probability) of successful entries. Additionally, the probability of a successful transaction obviously depends on the selected stop loss and take profit parameters, on the methods of managing lots and many other system parameters. All this can be analyzed from the point of view of probability theory and used to make more accurate, but also more complex calculations of trading decisions.

The files are attached below:

  • indstats.mq4 (aka indstats.mq5) - expert;
  • common-includes.zip - archive with common mqh header files used;
  • additional-mt5-includes.zip - archive with additional header files for MT5;
  • instats-tester-sets.zip - archive with set settings files;

Most common application Naive Bayes- classification of documents. Is this email spam or welcome news? Is this Twitter post benign or angry? Should this intercepted cell phone call be turned over to federal agents for further investigation? You provide “training data,” such as classified example documents, to the learning algorithm, which can then “categorize” new documents into the same categories using existing knowledge.

The most common approach to document classification is to use a model a set of words combined with a naive Bayes classifier. Model a set of words perceives the document as a jumble of words. “Johnny ate the cheese” for him is the same as “Johnny ate the cheese” - both consist of the same set of words: (“Johnny”, “ate”, “cheese”).

Download the note in or format, examples in format

A short introduction to probability theory. The expression p() is used to denote probability. For example, p(A) = 0 , 2 means that event A will occur with a probability of 20%. Expressions like p(A|B) are used to denote conditional probabilities. For example, p(A|B) = 0.3 means that the probability of event A, given that event B occurs, is 30%. The joint probability p(A, B) is used to indicate the probability that events A and B will occur simultaneously. If events A and B are independent, then p(A, B) = p(A) * p(B). If events A and B are dependent, then p(A, B) = p(A) * p(B|A).

As an example, we study tweets about a service for sending emails - Mandrill.com. When searching by keyword - mandrill– in addition to useful ones, links that are not relevant also appear. Our job is to filter out relevant tweets. Let's say we previously accumulated a database of 300 tweets: 150 about the Mandrill.com application, and 150 about others.

We break each tweet down into individual words (called tokens - token). We are interested in two probabilities:

p (application | word1, word2...)
p (other | word1, word2, ...)

This is the probability that the tweet is either about an application or something else, given that we detect the words: “word1”, “word2”, etc.

(1) p (application | word1, word2, ...) > p (other | word1, word2, ...)

then this tweet is about Mandrill.com. But how to calculate these probabilities? The first step is to use Bayes' theorem, which allows the application's conditional probability to be rewritten as:

Similar

Substituting (2) and (3) into (1) and multiplying both sides by p(word1, word2, ...), we obtain condition (1) in the form:

(4) p(adj.) * p(word1,word2,...|adj.) > p(other) * p(word1,word2,...|other)

The a posteriori maximum rule (MAR) used for analysis allows, firstly, not to pay attention to the difference in the values ​​of p (app.) and p (other), and secondly, to consider the probabilities of words appearing in a tweet to be independent (although this is not the case !), and replace:

р(word1,word2,…| app.) –> р(word1| app.) * р(word2| app.) * …
p(word1,word2,...|other) –>

In the final form, we will compare two quantities:

(5) p(word1| app.) * p(word2| app.) * … > p(word1|other) * p(word2|other) * …

The independence assumption allows us to split the joint conditional probability of a set of words given a known class into the probabilities of finding each word in that class. By treating the words as independent, we introduce a lot of errors into the MAP algorithm, but in the end they do not affect the correct choice between the set related to the application and other tweets.

There are two problems left to solve: what to do with rare words, and how to defeat the vanishingly small values ​​that appear when multiplying a large number of probabilities close to zero? It is customary to add one to every value (even zero). This is called incremental smoothing and is often used to fit previously unknown words into a bag-of-words model. And instead of multiplication, addition of logarithms is used. For example, you have the product: 0.2 * 0.8. Take the logarithm of it: ln(0.2 * 0.8) = ln(0.2) + ln(0.8).

So, all explanations are given, and you can move on to Excel. The first two pages of the example book each contain 150 tweets related to the Mandrill.com app (Figure 1) and other topics. Consistently, in the original text of tweets, all letters are replaced with lowercase ones, and then punctuation marks with spaces. For example, the formula in cell E2 =SUBSTITUTE(D2;"?";" ") - replaces all question marks in the text contained in cell D2 with spaces.

Rice. 1. Removing unnecessary characters in the database of tweets about the application (to enlarge the image, right-click on it and select Open image in new tab)

Now we need to count how many times each word is used in posts in a given category. To do this, you need to collect all the words from the tweets of each database in one column. Assuming that each tweet contains no more than 30 words, and intending to assign each token its own line, you will need 150 * 30 = 4500 lines. Create a new sheet, name it Tokens_adj. Name cell A1 Tweets. Copy the values ​​H2:H151 from the sheet to the clipboard Application. Select on sheet Tokens_adj. area A2:A4501 and click Insert –> Special insert –> values(Fig. 2). Click Ok. Note that since you're inserting 150 tweets into 4,500 rows, Excel repeats everything for you. This means that if you highlight the first word from the first tweet on line 2, that same tweet will be repeated to highlight the second word on line 152, the third on line 302, and so on.

Study the formulas in columns B:D of the sheet Tokens_adj. to understand the mechanics of sequentially extracting tokens from a tweet (Figure 3). Create a sheet in the same way Tokens_other for a database of tweets not related to the Mandrill.com application.

Rice. 3. Fragment of a sheet Tokens_adj., which retrieves tokens from a database of tweets related to the Mandrill.com app

Now based on the sheet Tokens_adj. you should create a pivot table that counts the number of occurrences of each token. Using the pivot table filter, exclude words up to 4 characters long, and also add columns to calculate the logarithm of the frequency of occurrence of a token (Fig. 4). Repeat the operation for the sheet Tokens_other

Now that the classifier model is “trained,” it’s time to use it. On a sheet Test 20 tweets have been posted that need to be classified. They are also processed (as in Fig. 1). Place the prepared tweets on the sheet Classification. Highlight D2:D21 and select DATA –> Column text. In the window that appears, select With delimiters and press Further. In the second step, select tab and space characters as separators, and Count consecutive delimiters as one(Fig. 5). Set the line delimiter (No). Click Further. At the last step Column Data Format install general. Click Ready.

The procedure will scatter tweets across the columns of the entire sheet up to the AI ​​column (Fig. 6).

Now, using the VLOOKUP function, we will extract data on the logarithms of the probabilities of test tokens occurring in two data sets (application / others). Let's compare the amounts and draw a conclusion about whether the tests belong to one class or another (Fig. 7). Tweets whose logarithm difference is less than 1 are highlighted in color. More details on the formulas can be found on the sheet Classification.

That's all. The model is built, assumptions are made.

Written based on the book by John Forman. – M.: Alpina Publisher, 2016. – P. 101–128

Statistics- a science that studies the quantitative side of mass socio-economic phenomena and processes, in inextricable unity with their qualitative side in specific conditions of place and time.

In the natural sciences, the concept of “statistics” means the analysis of mass phenomena based on the application of methods of probability theory.

Statistics develops a special methodology for research and processing of materials: mass statistical observations, method of groupings, average values, indices, balance method, method of graphic images.

Methodological features is the study of: the mass nature of phenomena, qualitatively homogeneous signs of a particular phenomenon in dynamics.

Statistics include a number sections, including: general theory of statistics, economic statistics, sectoral statistics - industrial, agricultural, transport, medical.

11. Groups of indicators for assessing the health status of the population.

Population health is characterized by three groups of main indicators:

A) medical and demographic – reflect the state and dynamics of demographic processes:

    Population statistics (density, distribution, social composition, gender and age composition, literacy, education, nationality, language, culture.)

    Population dynamics (mechanical emigration and immigration, natural fertility, mortality, natural increase.)

    Marital status (marriage rate, divorce rate, average length of marriage.)

    Reproduction processes (total fertility, gross coefficient and net coefficient.)

    Average life expectancy

    Mortality (structure of mortality, mortality rates depending on the cause, nature of morbidity and age.)

B) indicators of morbidity and injury (primary morbidity, prevalence, cumulative morbidity, pathological incidence, health index, mortality, injuries, disability.)

C) indicators of physical development:

    Anthropometric (height, body weight, circumference of the chest, head, shoulder, forearm, lower leg, thigh)

    Physiometric (vital capacity of the lungs, muscle strength of the hands, backbone strength)

    Somatoscopic (physique, muscle development, degree of fatness, shape of the chest, shape of the legs, feet, severity of secondary sexual characteristics.)

    Medical statistics, its sections, tasks. The role of the statistical method in the study of population health and the performance of the health care system.

Medical (sanitary) statistics - studies the quantitative side of phenomena and processes related to medicine, hygiene and healthcare.

There are 3 sections of medical statistics:

1. Population health statistics- studies the health status of the population as a whole or its individual groups (by collecting and statistical analysis of data on the size and composition of the population, its reproduction, natural movement, physical development, the prevalence of various diseases, life expectancy, etc.). Health indicators are assessed in comparison with generally accepted assessment levels and levels obtained for different regions and over time.

2. Health statistics- resolves issues of collecting, processing and analyzing information about the network of healthcare institutions (their location, equipment, activities) and personnel (the number of doctors, nursing and junior medical personnel, their distribution by specialty, length of service, their retraining, etc. .). When analyzing the activities of treatment and preventive institutions, the data obtained are compared with standard levels, as well as levels obtained in other regions and over time.

3. Clinical statistics- this is the use of statistical methods when processing the results of clinical, experimental and laboratory studies; it allows, from a quantitative point of view, to assess the reliability of the research results and solve a number of other problems (determining the volume of the required number of observations in a sample study, forming experimental and control groups, studying the presence of correlation and regression connections, eliminating qualitative heterogeneity of groups, etc.).

The objectives of medical statistics are:

1) study of the health status of the population, analysis of quantitative characteristics of public health.

2) identifying connections between health indicators and various factors of the natural and social environment, assessing the influence of these factors on the health levels of the population.

3) study materially - technical base of health care.

4) analysis of the activities of medical institutions.

5) assessment of the effectiveness (medical, social, economic) of ongoing therapeutic, preventive, anti-epidemic measures and healthcare in general.

6) the use of statistical methods when conducting clinical and experimental biomedical research.

Medical statistics is a method of social diagnostics, since it allows one to assess the health status of the population of a country or region and, on this basis, develop measures aimed at improving public health. The most important principle of statistics is its application to the study not separate, isolated, but mass phenomena, in order to identify their general patterns. These patterns appear, as a rule, in a mass of observations, that is, when studying a statistical population.

In medicine, statistics is the leading method, because:

1) allows you to quantitatively measure public health indicators and performance indicators of medical institutions

2) determines the strength of influence of various factors on public health

3) determines the effectiveness of treatment and recreational activities

4) allows you to assess the dynamics of health indicators and allows you to predict them

5) allows you to obtain the necessary data for the development of health care norms and standards.

    Statistical population. Definition, types, properties. Features of the study of a statistical population.

The object of any statistical study is a statistical population.

Statistical population- a group consisting of many relatively homogeneous elements, taken together within certain boundaries of space and time and possessing signs of similarity and difference.

Properties of a Statistical Population: 1) homogeneity of observation units 2) certain boundaries of space and time of the phenomenon being studied

The object of statistical research in medicine and healthcare can be various populations (the population as a whole or its individual groups, sick, dead, born), medical institutions, etc.

There are two types of statistical population :

a) general population

b) sample population

1. The sample population is formed in such a way as to provide an equal opportunity for all elements of the original population to be covered by observation.

2. the sample population must be representative, accurately and completely reflect the phenomenon, i.e. give the same idea of ​​the phenomenon as if the entire population were studied.

Sample population

1) must be representative, accurately and completely reflect the phenomenon, i.e. give the same idea of ​​the phenomenon as if the entire population were studied, for this it must:

A. be sufficient in number

b. have the main features of the general population (all elements must be represented in the selected part in the same proportion as in the general population)

2) when forming it, it must be observed

1) random selection- selection of observation units by drawing lots using a table of random numbers, etc. In this case, each unit is provided with an equal opportunity to be included in the sample.

2) mechanical selection- units of the general population, sequentially arranged according to some characteristic (alphabetically, by date of visit to the doctor, etc.), are divided into equal parts; From each part, in a predetermined order, every 5, 10 or nth observation unit is selected in such a way as to ensure the required sample size.

3) typical(typological) selection - involves the mandatory preliminary division of the general population into separate qualitatively homogeneous groups (types) with the subsequent selection of observation units from each group according to the principles of random or mechanical selection.

4) serial(cluster, cluster) selection - involves sampling from the general population not of individual units, but of entire series (an organized collection of observation units, for example, organizations, districts, etc.)

5) to combined methods - a combination of different methods of forming a sample.

    Sample population, requirements for it. Principles and methods of forming a sample population.

There are two types of statistical population :

a) general population- a set consisting of all units of observation that can be attributed to it in accordance with the purpose of the study. When studying public health, the general population is often considered within specific territorial boundaries or may be limited by other characteristics (gender, age, etc.) depending on the purpose of the study.

b) sample population- part of the general population, selected by a special (sampling) method and intended to characterize the general population.

Features of conducting statistical research on a sample population:

1. The sample population is formed in such a way as to provide an equal opportunity for all elements of the original population to be covered by observation.

2. the sample population must be representative, accurately and completely reflect the phenomenon, i.e. give the same idea of ​​the phenomenon as if the entire population were studied.

Sample population- part of the general population, selected by a special (sampling) method and intended to characterize the general population.

Requirements for the sample population:

1) must be representative, accurately and completely reflect the phenomenon, i.e. give the same idea of ​​the phenomenon as if the entire population were studied, for this it must:

A. be sufficient in number

b. have the main features of the general population (all elements must be represented in the selected part in the same proportion as in the general population)

2) when forming it, it must be observed the basic principle of forming a sample population: equal opportunity for each observation unit to be included in the study.

Methods for forming a statistical population:

1) random selection - selection of observation units by drawing lots using a table of random numbers, etc. In this case, each unit is provided with an equal opportunity to be included in the sample.

2) mechanical selection - units of the general population, sequentially arranged according to some characteristic (alphabetically, by date of visit to the doctor, etc.), are divided into equal parts; From each part, in a predetermined order, every 5, 10 or nth observation unit is selected in such a way as to ensure the required sample size.

3) typical (typological) selection - involves the mandatory preliminary division of the general population into separate qualitatively homogeneous groups (types) with the subsequent selection of observation units from each group according to the principles of random or mechanical selection.

4) serial (cluster, cluster) selection - involves sampling from the general population not of individual units, but of entire series (an organized collection of observation units, for example, organizations, districts, etc.)

5) combined methods - a combination of different methods of forming a sample.