StatHEP - Statistical Data Analysis for HEP

The first version of the application with a full grid functionality is ready. A number of short test productions has been succesfully executed in the BalticGrid infrastructure. First preliminary physics results concerning the expected sensitivity for the CP violation measurement in the LHCb experiment have been obtained recently.

The application.

The goal of the application is to provide a generic environment for running the toy Monte Carlo jobs. It is primarily ment to study the sensitivity and systematic biases of the CP violation measurements but it can be equally well used for solving other statistical problems. 

The application consists of three main parts:

  1. Framework part - RooFit and ROOT
  2. User part - user ROOT macro (C++) and the script to generate input data files.
  3. GRID part - submitting jobs and retrieval of results.

The framework part is responsible for providing ROOT environment on a given WN. When the job arrives at WN the existence of ROOT installation is checked and if it is not detected, the ROOT source code is fetched, compiled and the installation tree is build in the common area of a given VO. The name of the top directory contains the coded information about processor architecture and compiler version to allow inhomogeneous structure of the site.
Next comes the execution of the user macro. A main part is the ROOT macro which is written in C++. The macro can be executed in the interpreter mode or it can be compiled and dynamically linked to the ROOT executable. The latter mode is recommended as it is much faster. 
The GRID functionality is  provided by the scripts. The main steps are to generate the input data files (parameters to run the user macro), to prepare the GRID job and to submit. Each job produces some results in the form of text files or ROOT histograms/ntuples. In the final step all results are retrieved and analysed to obtain the final summary result.
It is planned in future to prepare the GUI interface, that probably will be based on the Migrating Desktop.

It is worth to note that the application can be used outside HEP. The only elements that need to be tailored or replaced by a user are: a ROOT macro, a script to generate input data and a script to collect and analyse results.

Toy MC technique in a nutshell.

Physics results are often complicated functions of observables. A common feature of analyses is that the measurement is a result of a complicated procedure, often including a simultaneous fit to several distributions, each depending on a number of unknown parameters. Analytical evaluation of errors of such measuremnts is difficult or impossible. A Toy Monte Carlo technique is commonly used in such cases. A simplified model of the measurement is built and a large number of identical jobs is launched,each with a different random number seed. The width of the distribution of the measured values (a typical distribution is shown below) is taken as the estimate of the measurement uncertainty. In more complicated cases the measurement might depend on a number of parameters.

The dedicated package which provides tools for building Toy MC model is called RooFit. It was developed for the BaBar experiment  (RooFit at BaBar pages and Introduction slides) under the ROOT framework. The RooFit contains many advanced utilities like conditional PDF.

A Toy Monte Carlo approach is very often used in HEP experiments. The HEP detectors are complex aparatus and the phenomena studied are very subtle. The required precise understanding of the whole chain of the data acquisition and analysis is provided by the full simulation program. The full simulation is a very CPU intensive task, thus only limited statistics MC samples can be produced that are insufficient to study measurements precisions.
Toy Monte Carlo technique is commonly used in such cases. The main idea is to prepare a simplified model of the measurement employing probability density functions, acceptance functions etc., derived from the full simulation program, submit a large number of jobs which execute the procedure with different initial parameters on a GRID and analyse distributions of outcomes. A typical measurement of combined CP symmetry violation belongs to this category of problems. The final result depends on many other parameters, some of them determine the PDF (Probability Density Function) or a variable the results depend on.

StatHEP for LHCb experiment

The purpose of LHCb experiment is to measure the phenomena of CP violation in the systems of B and Bs mesons. The CP measurements are result of a complicated procedure. The B mesons are produced in hadronic environment of proton-proton collisions.

The b quarks events ammount to only 1 % of all interactions. Moreover B decay events that are interesting for CP measurements are relatively rare, they occur with probabilities ranging from 10^-4 down to 10^-9. Extraction of a tiny signal out of the huge background requires sophisticated algorithms to be used already at the level of on-line data taking (bandwidth reduction from 40 MHz down to 200 Hz). The data are then reconstructed off-line and CPV phenomena are studied for more then 50 different decay channels. In the final step a multiparameter fit to the data is performed.

The whole chain of the data taking and data analysis is studied by means of the full simulation program. It describes the details of passage of particles through detector material and response of the readout electronics. Its complexity makes practically impossible to produce sufficiently high statistics samples (including background) to estimate an uncertainty of the measurement. Toy Monte Carlo technique is commonly used in such cases. The main idea The simplified model of the measurement procedure is based on the probability density functions, acceptance functions etc. derived from the full simulation program. This way a large number of pseudo-experiments can be modelled. The result of the typical study is presented in the figure below. It illustrates the estimated uncertainty of one of the CP violation parameters, the angle gamma of the unitarity triangle. Each 'data' point corresponds to 1000 jobs executed on GRID.



Contact persons:
Mariusz Witek, email: Mariusz.Witek@ifj.edu.pl
Michal Krasowski, email: Michal.Krasowski@ifj.edu.pl