StatHEP - Statistical Data Analysis for HEP
The first version of the application with a full grid functionality is ready.
A number of short test productions has been succesfully executed in the BalticGrid infrastructure.
First preliminary physics results concerning the expected sensitivity for the CP violation
measurement in the LHCb experiment have been obtained recently.
The application.
The goal of the application is to provide a generic environment for running
the toy Monte Carlo jobs. It is primarily ment to study the sensitivity and systematic biases
of the CP violation measurements but it can be equally well used for solving
other statistical problems.
The application consists of three main parts:
- Framework part - RooFit
and ROOT
- User part - user
ROOT macro (C++) and the script to generate input data files.
- GRID part - submitting
jobs and retrieval of results.
The framework part is
responsible for providing ROOT environment on a given WN. When the job
arrives at WN the existence of ROOT installation is checked and if it is not
detected, the ROOT source code is fetched, compiled and the
installation
tree is build in the common area of a given VO. The name of the top
directory contains the coded information about processor architecture
and compiler version to allow inhomogeneous structure of the site.
Next comes the execution of the user macro.
A main part is the ROOT macro which is written in C++. The macro can be
executed in the interpreter mode or it can be compiled and dynamically linked
to the ROOT executable. The latter mode is recommended as it is much
faster.
The GRID functionality
is provided by the scripts. The main steps are to generate the
input data files (parameters to run the user macro), to prepare the
GRID job and to submit. Each job produces some results in the form of text
files or ROOT histograms/ntuples. In the final step all results are
retrieved and analysed to obtain the final summary result.
It is planned in future to prepare the GUI interface, that probably will be based on the Migrating
Desktop.
It is worth to note that the application can be used outside HEP. The
only elements that need to be tailored or replaced by a user are:
a ROOT macro, a script to generate input
data and a script to collect and analyse results.
Toy MC technique in a nutshell.
Physics results are often complicated functions of observables. A common feature of analyses is that
the measurement is a result of a complicated procedure, often including a simultaneous
fit to several distributions, each depending on a number of unknown parameters.
Analytical evaluation of errors of such measuremnts is difficult or impossible.
A Toy Monte Carlo technique is commonly used in such cases.
A simplified model of the measurement is built and a large
number of identical jobs is launched,each with a different random number seed.
The width of the distribution of the measured values (a typical distribution
is shown below) is taken as the estimate of the measurement uncertainty.
In more complicated cases the measurement might depend on a number of parameters.

The dedicated package
which provides tools for building Toy MC model is called RooFit. It was developed for
the BaBar experiment (RooFit
at BaBar pages and Introduction
slides) under the ROOT framework.
The RooFit contains many advanced utilities like conditional PDF.
A Toy Monte Carlo approach is very often used in HEP experiments.
The HEP detectors are complex aparatus and the phenomena studied are very subtle.
The required precise understanding of the whole chain of the data acquisition
and analysis is provided by the full simulation program.
The full simulation is a very CPU intensive task, thus only limited statistics
MC samples can be produced that are insufficient to study measurements precisions.
Toy Monte Carlo technique is commonly used in such cases. The main idea
is to prepare a simplified model of the measurement employing
probability density functions, acceptance functions etc., derived from
the full simulation program,
submit a
large number of jobs which execute the procedure with different
initial parameters on a GRID and analyse distributions of outcomes.
A typical measurement of combined CP symmetry violation belongs to
this category of problems. The final result depends on many other parameters,
some of them determine the PDF (Probability Density Function) or a variable
the results depend on.
StatHEP for LHCb experiment
The purpose of LHCb
experiment is to measure the phenomena of CP
violation in the systems of B and Bs mesons. The CP measurements are
result of a complicated procedure. The B mesons are produced in
hadronic environment of proton-proton collisions.

The b quarks events ammount to only 1 % of all interactions.
Moreover B decay
events that are interesting for CP measurements are relatively rare, they occur
with probabilities ranging from 10^-4 down
to 10^-9. Extraction of a tiny signal out of the huge
background requires
sophisticated algorithms to be used already at the level of on-line data
taking (bandwidth reduction from 40 MHz down to 200 Hz). The data are
then reconstructed off-line and CPV phenomena are
studied for more then 50 different decay channels. In the final step a
multiparameter fit to the data is performed.
The whole chain of the data taking and data analysis is studied by means of
the full simulation program.
It describes the details of passage of particles through detector material and response
of the readout electronics. Its complexity makes practically impossible to produce
sufficiently high statistics samples (including background)
to estimate an uncertainty of the measurement.
Toy Monte Carlo technique is commonly used in such cases. The main idea
The simplified model of the measurement procedure is based on the
probability density functions, acceptance functions etc. derived from
the full simulation program. This way a large number of pseudo-experiments can be
modelled.
The result of the typical study is presented in the figure below.
It illustrates the estimated uncertainty of one of the CP violation parameters,
the angle gamma of the unitarity triangle. Each 'data' point corresponds to 1000
jobs executed on GRID.

Contact persons:
Mariusz Witek, email: Mariusz.Witek@ifj.edu.pl
Michal Krasowski, email: Michal.Krasowski@ifj.edu.pl