# gbutils overview

## Table of Contents

## Brief description of programs

The programs in `gbutils`

can be divided in four broad classes:

- Data Manipulation
- Data Transformation
- Descriptive Statistics
- Statistical Tests and Models

The basic operation is essentially the same for all programs: you feed the standard input of the program with data in ASCII format separated by spaces, tabs or newline character. In general, each input line is considered a record and the blank separated entries in each line are considered different fields. The exact way in which different records and fields are treated depends on the program and can vary accordingly to the options specified in the command line (see below).

After the program has read the data from standard input, it performs the required manipulations/analyses and prints the result to standard output, in the form of an ASCII file of newline separated records. Inside each output record, the fields are separated by spaces. Obviously, the meaning of the records and fields depend on the program.

### Data Manipulation

These programs do not perform any analysis by themselves. Rather, they
are provided as an help to prepare data for subsequent analysis. In
particular, gbget is the only program that reads data from file and
not from standard input. It can be used to extract data, according to
a given pattern, from one or more files and send them, through a pipe
`|`

, to other utilities. This program possesses a rather complex set
of options. See README.gbget for a tutorial on its use.

- gbget
- extract data from a tabular input according to a specified pattern. It is possible to access more files at the same time, merge their contents and transpose or flatten the resulting table.
- gbfun
- compute generic functions on data in a column-wise manner. The function can be applied to all the columns or defined in a recursive way.
- gbgrid
- generate a grid (i.e. a matrix) of values according to a user specified function
- gbboot
- generate bootstrapped sequences from data provided sample
- gbrand
- generated i.i.d. pseudo random variates
- gbenv
- provide information about the numeric environment and the internal settings of the package

### Data transformation

These programs perform basic transformation on input data, which are often considered preliminary to further statistical analysis.

- gbmave
- print moving statistics (average, variance, etc.) of input data

- gbinterp
- compute the interpolation on a regular mesh of user provided points. It can also print first and second derivative of the interpolation.
- gbfilternear
- filter near points in Ecuclidean metrics. Point whose distance is below a given threshold are removed.

### Descriptive statistics

These utilities are useful in the representation and description of data. They encompass simple statistics and more "advanced" non parametric methods.

- gbdist
- cumulative distribution of input data
- gbstat
- simple descriptive statistics of input data
- gbbin
- compute binned statistics
- gbquant
- quantiles of the empirical distribution of input data
- gbhisto
- histogram for univariate data. Choose between absolute frequencies, relative frequencies and empirical density
- gbker
- kernel density estimate for univariate data. The type of kernel, the bandwidth and the computation method can be specified at the command line
- gbnear
- density estimate via nearest neighbors method
- gbker2d
- kernel density estimate for bivariate data
- gbhisto2d
- histograms for bivariate data
- gbgcorr
- Compute the correlation dimension of a time series with a Gaussian kernel.
- gbacorr
- It computes the autocorrelogram or the cross-autocorrelogram of a series of observations. It reads the data column-wise.
- gbxcorr
- Compute the cross-covariance and cross-correlation coefficients with and without the removal of the mean of two samples. It reads the data column-wise.

### Statistical tests and models

The utilities provide statistical tests to compare different samples and non-parametric method to investigate relationship between paired (or in general compounded) observations.

- gbtest
- various one and two samples statistical tests. When available, p-score significance is also provided.
- gbmodes
- find the critical bandwidth for a kernel density estimate to generate a given number of modes and compute the associated p-value using smoothed bootstrap technique
- gbbin
- the program takes couples of values X Y (separated by spaces), bins them with respect to the first variable and prints statistics of the second variables
- gbkreg
- compute the kernel non-linear regression function
- gbkreg2d
- compute the kernel non-linear regression function on three dimensional data
- gblreg
- compute linear OLS regression
- gbglreg
- compute generalized linear OLS regression
- gbnlreg
- compute non linear regression using OLS, MAD or asymmetric MAD estimators
- gbnlqreg
- compute non linear quantile regression
- gbnlmult
- contemporaneous least square estimation of a system of non linear equations.
- gbnlprobit
- estimate a non linear probit model on binary data
- gbhill
- estimate different families of probability distribution on the extremal data using maximum likelihood.

For more information on a specific command, use the -h command line option.

Please, notice that all the programs work by loading the whole set of data in memory before computing the relevant statistics. In this respect, they are probably not suitable to be used on very large datasets.

## Understanding Input/Output

All the commands of this package read input in ASCII format. The data
should be separated by white characters (spaces or tabs) or
newlines. Lines beginning with a fence symbol `#`

are ignored. They
are simply skipped by the input routine.

If support for the zlib has been included at compile time (see above) the input ASCII file can be gz-compressed.

A file can contain several blocks of data. Blocks are separated by two consecutive blank lines. In general, all operations are performed on the first block found in the datafile. The program 'gbget' can be used to extract one particular block (or set of blocks) from one file.

### Sequential, tabular and compounded input

The utilities in this package use three different ways of reading data from input:

- sequential
- In 'sequential' format a single dataset is internally build from the data input file. All the entries found on one column of input are read sequentially and put in the same dataset. Notice that the different lines must contain the same number of entries or NAN values are generated.
- tabular
- In 'tabular' format, each column of the input is treated as a different dataset; the program will internally create a list of datasets, one for each column of input. The different entries on one line are then put inside different sets. Notice that, in this case, the number of fields in the first non-comment non-empty line decides the number of datasets. All subsequent input lines should contain the same number of fields (but see below).
- compounded
- In 'compounded' format the program reads a fixed number of fields from each line. Each line is internally stored as an n-tuple (a couple or a triplet) and treated accordingly. Notice that if some line contains more fields than needed, the extra fields are ignored.

An example can clarify the difference between the 'sequential', 'tabular' and 'compounded' format. Suppose to have the following input datafile

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

using the 'sequential' input the unique dataset
`{1.0,2.0,3.0,4.0,...,9.0}`

is internally generated by the
program. Notice the ordering: all the entries of one line are inserted
in the internal dataset before the next line is red.

In 'tabular' format, the program builds instead three different
datasets: `{1.0,4.0,7.0}`

, `{2.0,5.0,8.0}`

and `{3.0,6.0,9.0}`

and use
each set separately for its subsequent duties (topically reproducing
the same statistical analysis for each set).

In 'compounded' format, assuming that the program accepts couples, the
following array of ordered couples is generated
`{(1.0,2.0),(4.0,5.0),(7.0,8.0)}`

. Notice that this is a single
dataset, made of couples of associated values.

When available, the "sequential" format is the default while the "tabular" format is activated with the option '-t'. See the Programs summary table for the list of input format accepted by the different utilities.

### Missing values and NaN management

When the conversion of an input entry to an internal floating point number cannot be performed, or when, on an input line, there are not enough values for the required "tabular" or "compounded" format, a NAN (not-a-number) value is generated. This approach is introduced to make possible the manipulation of files with an uneven number of entries in different columns or with "non numerical" values.

The following utilities automatically remove the `NaN`

values from their
input: `gbstat`

, `gbdist`

and `gbquant`

.

The other utilities do handle `NaN`

values as expected: if `NaN`

values
are present they typically return `NaN`

output. In this case, the option
`D`

of the `gbget`

utility is provided to remove all the lines
containing `NaN`

entries. This program can be used in a pipe like

...| gbget '()D' | ....

to treat the data before passing them to other `NaN`

-sensitive
utilities.

### Radix and thousands separator

In addition to the radix symbol which separates the fractional and the integer part of the number, sometimes data are reported with a thousand separator symbol. For instance "one million" could be written "1,000,000.00". The character used to separate thousands and the fractional part are defined inside the C locale. Programs in the gbutils package can automatically recognize the locale settings and process these entries accordingly. Please use "gbenv" to see the definitions in use. Changing the locale typically amount simply to the redefinition of the LANG environment variable

# export LANG="en_US"

A list of the available locale can be obtained with the locale program

# locale -a

and the actual setting verified with

# locale

For more details refer to the locale documentation.

### Output format and precision

In general, the output from the different programs is made of newline
separated records of space separated fields of standard ASCII
characters, which represent floating point numbers. The default format
is scientific notation with a precision of six digits. The format and
the precision can be changed using the environment variable
`GB_OUT_FLOAT_FORMAT`

. This variable can be set to any `printf`

(the
standard library C function) meaningful string. For instance with

# export GB_OUT_FLOAT_FORMAT="%.8e"

the precision is extended to eight digit. While with

# export GB_OUT_FLOAT_FORMAT="%.fe"

the scientific notation is replaced with a fixed-point
notation. Please, refer to the `printf`

documentation for further
details.

There is also a second variable, `GB_OUT_EMPTY_FORMAT`

, which can be
used to tune the comment headings that many programs generate with the
verbose option `-v`

. Notice that it is automatically set to a value
which is consistent with the float format chosen, so in general it is
a good idea not to change it explicitly.

## Numerical Error handling

The default behaviour of Gnu Scientific Library functions is to abort
the execution of the program if a numeric error is produced. Some of
these errors, especially underflow errors, are tolerable inside a
computation. The 'gbutils' package provides a way of switching off the
GSL error handling. It is sufficient to set the environment variable
`GB_ERROR_HANDLER_OFF`

using

# export GB_ERROR_HANDLER_OFF=

and all the programs will ignore numerical errors. This feature must be used carefully, after checking that the loss of precision implied by the presence of these errors can be considered tolerable for the actual computation one wants to perform. The default behaviour can be recovered using

# unset GB_ERROR_HANDLER_OFF

## Binary format

*THIS IS AN EXPERIMENTAL FEATURE*

Like ASCII files, the binary files are structured as sequences of separate blocks. Each block is made of

- one
`size_t`

with number of columns C - C
`size_t`

with the length of the rows, R_{1}… R_{C} - the data stored sequentially column by column, for a total number
equal to R
_{1}+R_{2}+…+R_{C}

This structure allows the storage of non matrices structures in binary format. If lengths are different, the missing values are replaced with NANs. This mimic the behaviour of ASCII data handling.

Notice that blocks are simply written one after the other. No particular separators are inserted between them.

Implementation: the option `-b`

redefines the function used to read
and/or write data.

This feature has been implemented for `gbget`

, `gbmstat`

and `gbfun`

.

## Graphic output

As previously mentioned, the output of many programs in the gbutils package, like gbhisto or gbker, is intended to be plotted and not directly read from the terminal. It is generally composed of records and fields of standard ASCII characters. This type of output can be displayed using the various plotting utilities commonly available in Unix systems. We shortly review below three possibilities.

### GNU plotutils package

The plotutils package can be found here. It contains the program
`graph`

which generate a plot starting from input data. For example to
obtain a plot of the kernel density of the data in file `datafile.dat`

one can use

gbker < datafile.dat | graph -T x

where `-T x`

choose xwindow as output device.

### Gnuplot interactive session

An alternative is to use the powerful plotting environment provided by gnuplot. The program can be found here.

From inside a gnuplot session, the previous kernel density can be obtained with

plot "< gbker < datafile.dat "

see Gnuplot documentations for details.

### Gnuplot's plot from command line

As the example above shows, in order to plot the output of a command
inside gnuplot you need to put it inside an expression delimited by "<
and ". Moreover, all double quotes " have to be escaped with a
backslash \. These requirements can lead to cumbersome expressions
when complicated commands are necessary. Moreover, starting an
interactive gnuplot session and writing the expression whose output
should be plotted doesn't seem so attractive when one needs fast,
simple plotting, for exploratory purposes. These are the cases foe
qhich the command `gbplot`

is provided. This is a shell script that
accept the data to be plotted as input and the directive on how to
plot it on the command line.

The basic usage is as follows

gbplot [options] [plot|splot] <plotting options> < datafile

or

command pipe | gbplot [options] [plot|splot] <plotting options>

The command `plot`

or `splot`

are required. One can provide further
plotting options by inserting them after these command. For example
one can plot the kernel density estimate using

gbker < datafile.dat | gbplot plot

In this way the density is plotted using simple points. To use the fancier gnuplot's 'histeps' style use instead

gbker < datafile.dat | gbplot plot with histeps

The syntax of the plotting options is exactly the same that would be
used inside gnuplot, after a the `plot`

or `splot`

command. For
instance to specify a range for the x values use

gbker < datafile.dat | gbplot plot '[-1:1]' with histeps

It is also possible to obtain multiple plots of the data using the gnuplot special file name '""', as in

gbker < datafile.dat | gbplot plot 'w p , "" w l'

This command draws the kernel estimate two times: the first with
points, the second with a line (as specified by the `w l`

expression).

`gbplot`

also possesses several options. They must be specified before
the `plot`

or `splot`

command. To insert a title in the plot use the
option `-t`

gbker < datafile.dat | gbplot -t Title plot with histeps

Terminal type and output file can be specified with the `-T`

and `-o`

options respectively. The command

gbker < datafile.dat | gbplot -T pdf -o output.pdf plot with histeps

produce a pdf version fo the plot and save it in 'output.pdf'.

Finally, if an interactive manipulation of plot parameters or data is
required, you can use the option `-i`

. This option opens an
interactive gnuplot session, allowing for direct manipulation of plot
settings and parameters

gbker < datafile.dat | gbplot -i plot with histeps

Once the session is closed, the output is saved in a file using a
specific terminal if options `-o`

and `-T`

have been specified.

## Programs summary table

Name | Input Type | External lib | NAN | |
---|---|---|---|---|

gbget | c+ | (matheval) | * | |

gbfun | c+ | matheval | ||

gbgrid | no | matheval | ||

gbrand | no | gsl | ||

gbboot | s,t,c+ | (gsl) | ||

gbenv | no | |||

gbmave | s,t | * | ||

gbinterp | c,2 | gsl | ||

gbfilternear | c+ | |||

gbdist | s,t | * | ||

gbstat | s,t | * | ||

gbquant | s,t,c+ | * | ||

gbhisto | s,t | |||

gbker | s | gsl | ||

gbnear | s | |||

gbhisto2d | c2 | |||

gbgcorr | s | |||

gbacorr | c1,c2 | |||

gbxcorr | c2 | |||

gbker2d | c2 | |||

gbbin | c+ | |||

gbtest | c+ | (gsl) | * | |

gbmodes | s | gsl | ||

gbbin | t | |||

gbkreg | c2 | gsl | ||

gbkreg2d | c3 | |||

gblreg | c2 | gsl | ||

gbglreg | c+ | gsl | ||

gbnlreg | c+ | gsl,matheval | ||

gbnlqreg | c+ | gsl,matheval | ||

gbhill | s | gsl | ||

gbnlmult | c+ | gsl,matheval | ||

gbnlprobit | c+ | gsl,matheval | ||

gbnlpanel | c+ | gsl,matheval | * | |

gbnlpolyit | c+ | gsl,matheval |

**Input Type**: 's' sequential; 't' tabular; 'c' compounded c2 read
couples, c3 triplets, c+ a variable number of columns; 'no' no input
required

**External libs**: gsl: Gnu Scientific Library matheval: GNU matheval
library () means optional dependence (special features are available
only if the library is found)

**NAN**: program automatically ignores NAN values in computations