22. Februar 2010

Statistische Funkionen: Alles auf einmal mit DBMS_STAT_FUNCS

English title: Statistic functions: DBMS_STAT_FUNCS

Die Tage habe ich eine nette PL/SQL-Funktion gefunden, die "auf einen Rutsch" alle möglichen Aggregate für eine Tabellenspalte berechnet - sie ist im Package DBMS_STAT_FUNCS (Dokumentation). Am besten guckt man sich das am Beispiel an.
These days I found a nice litte function which computes various aggregates for a (numeric) table column at once. It's contained in the package DBMS_STAT_FUNCS (Documentation). Let's have a look ...
SQL> desc DBMS_STAT_FUNCS

PROCEDURE SUMMARY
 Argument Name                  Typ                     In/Out Defaultwert?
 ------------------------------ ----------------------- ------ --------
 P_OWNERNAME                    VARCHAR2                IN
 P_TABLENAME                    VARCHAR2                IN
 P_COLUMNNAME                   VARCHAR2                IN
 P_SIGMA_VALUE                  NUMBER                  IN     DEFAULT
 S                              RECORD                  OUT
   COUNT                        NUMBER                  OUT
   MIN                          NUMBER                  OUT
   MAX                          NUMBER                  OUT
   RANGE                        NUMBER                  OUT
   MEAN                         NUMBER                  OUT
   CMODE                        NUM_TABLE               OUT
   VARIANCE                     NUMBER                  OUT
   STDDEV                       NUMBER                  OUT
   QUANTILE_5                   NUMBER                  OUT
   QUANTILE_25                  NUMBER                  OUT
   MEDIAN                       NUMBER                  OUT
   QUANTILE_75                  NUMBER                  OUT
   QUANTILE_95                  NUMBER                  OUT
   PLUS_X_SIGMA                 NUMBER                  OUT
   MINUS_X_SIGMA                NUMBER                  OUT
   EXTREME_VALUES               NUM_TABLE               OUT
   TOP_5_VALUES                 N_ARR                   OUT
   BOTTOM_5_VALUES              N_ARR                   OUT
Da wird ein RECORD-Datentyp zurückgegeben; zum Ausprobieren muss man also ein wenig PL/SQL drumherumbauen ...
The result is an OUT parameter of a RECORD type. So test it we need to wrap a bit of PL/SQL around ...
set serveroutput on

declare
  v_sum  dbms_stat_funcs.summaryType;
  v_extv dbms_stat_funcs.num_table;
  v_topn dbms_stat_funcs.n_arr;
  v_botn dbms_stat_funcs.n_arr;
begin
  dbms_stat_funcs.summary(
    P_OWNERNAME   => user,
    P_TABLENAME   => 'CUSTOMERS',
    P_COLUMNNAME  => 'CUST_CREDIT_LIMIT',
    S             => v_sum
  );


  dbms_output.put_line('COUNT:             '||v_sum.COUNT);
  dbms_output.put_line('MIN:               '||v_sum.MIN);
  dbms_output.put_line('MAX:               '||v_sum.MAX);
  dbms_output.put_line('RANGE:             '||v_sum.RANGE);
  dbms_output.put_line('MEAN:              '||v_sum.MEAN);
  for i in v_sum.cmode.first..v_sum.cmode.last loop
    dbms_output.put_line('CMODE['||i||']:          '||v_sum.CMODE(i));
  end loop;
  dbms_output.put_line('VARIANCE:          '||v_sum.VARIANCE);
  dbms_output.put_line('STDDEV:            '||v_sum.STDDEV);
  dbms_output.put_line('QUANTILE_5:        '||v_sum.QUANTILE_5);
  dbms_output.put_line('QUANTILE_25:       '||v_sum.QUANTILE_25);
  dbms_output.put_line('MEDIAN:            '||v_sum.MEDIAN);
  dbms_output.put_line('QUANTILE_75:       '||v_sum.QUANTILE_75);
  dbms_output.put_line('QUANTILE_95:       '||v_sum.QUANTILE_95);
  dbms_output.put_line('PLUS_X_SIGMA:      '||v_sum.PLUS_X_SIGMA);
  dbms_output.put_line('MINUS_X_SIGMA:     '||v_sum.MINUS_X_SIGMA);
  for i in 1..least(v_sum.extreme_values.count, 9) loop
    dbms_output.put_line('EXTREME_VALUES['||i||']: '||v_sum.EXTREME_VALUES(i));
  end loop;
  for i in v_sum.top_5_values.first..v_sum.top_5_values.last loop
    dbms_output.put_line('TOP_VALUES['||i||']:     '||v_sum.TOP_5_VALUES(i));
  end loop;
  for i in v_sum.bottom_5_values.first..v_sum.bottom_5_values.last loop
    dbms_output.put_line('BOTTOM_VALUES['||i||']:  '||v_sum.BOTTOM_5_VALUES(i));
  end loop;
end;
/

COUNT:             55500
MIN:               1500
MAX:               15000
RANGE:             13500
MEAN:              6153,261261261261261261261261261261261261
CMODE[1]:          1500
VARIANCE:          12936897,23619725962110509579833351432837
STDDEV:            3596,789851547802009993634287000834918826
QUANTILE_5:        1500
QUANTILE_25:       3000
MEDIAN:            7000
QUANTILE_75:       9000
QUANTILE_95:       11000
PLUS_X_SIGMA:      16943,6308159046672912421641222637660178
MINUS_X_SIGMA:     -4637,108293382144768719641599741243495239
TOP_VALUES[1]:     15000
TOP_VALUES[2]:     15000
TOP_VALUES[3]:     15000
TOP_VALUES[4]:     15000
TOP_VALUES[5]:     15000
BOTTOM_VALUES[1]:  1500
BOTTOM_VALUES[2]:  1500
BOTTOM_VALUES[3]:  1500
BOTTOM_VALUES[4]:  1500
BOTTOM_VALUES[5]:  1500
Das man das alles mit SQL auch selbst bauen kann, ist ja klar - aber welches SQL setzt die Datenbank hier im Hintergrund ab ...? Naja - kein Problem: Finden wir's raus ...
It's obvious that you could also write your own SQL in order to get those results. But I'm now keen on the actual SQL the database uses here ... Now let's trace it ...
SQL> alter session set sql_trace = true;

Session altered.

SQL> @summary.sql

SQL> exit
Mit dem ALTER SESSION-Kommando wird ein SQL Trace aktiviert - die Datenbank schreibt also alles SQL, was im Hintergrund abgesetzt wird, in eine Tracedatei. Jene Tracedatei liegt dann auf dem Datenbankserver in der sog. user-dump-destination. Um da ranzukommen, muss man nun also Zugriff auf den Datenbankserver haben ... (oder dieses Posting hier lesen). Hat man die Datei, so macht man sie zuerst mit dem tkprof-Werkzeug lesbar ...
The ALTER SESSION command starts a SQL trace. The database will now write each SQL statement which is being executed in the background into a tracefile (actually not only the SQL statement but also additional information). This tracefile is located in the user-dump-destination on the database server's filesystem. To get it you need access to the database server (or read this blog posting). Convert the file into a readable format with the tkprof utility.
$ tkrpof orcl_ora_2672.trc output.txt
Den zweiten Dateinamen könnt Ihr frei wählen. Wenn Ihr dann die Datei output.txt mit einem Texteditor öffnet und euch darin ein wenig umseht, solltet Ihr das SQL recht schnell finden ...
You can choose a random filename as the second parameter. Now open the file output.txt with a text editor and look around - you should find the background SQL quickly ...
with aux_summary as (
  select
    count(CUST_CREDIT_LIMIT) as count_V ,
    min(CUST_CREDIT_LIMIT) as min_V,    
    max(CUST_CREDIT_LIMIT) as max_V,
    max(CUST_CREDIT_LIMIT)- min(CUST_CREDIT_LIMIT) as range_V,    
    avg(CUST_CREDIT_LIMIT) as mean_V,
    variance(CUST_CREDIT_LIMIT) as  variance_V, 
    stddev(CUST_CREDIT_LIMIT) as stddev_V,    
    percentile_cont(0.05) within group (order by CUST_CREDIT_LIMIT) as quantile_5_V,    
    percentile_cont(0.25) within group (order by CUST_CREDIT_LIMIT) as quantile_25_V,
    median(CUST_CREDIT_LIMIT) as median_V,
    percentile_cont(0.75) within group (order by CUST_CREDIT_LIMIT) as quantile_75_V,
    percentile_cont(0.95) within group (order by CUST_CREDIT_LIMIT) as quantile_95_V,
    (avg(CUST_CREDIT_LIMIT) + :1 * stddev(CUST_CREDIT_LIMIT)) as plus_x_sigma_V,
    (avg(CUST_CREDIT_LIMIT) - :1 * stddev(CUST_CREDIT_LIMIT)) as minus_x_sigma_V
  from "SH".CUSTOMERS
) 
:
Das SQL ist etwas länger .. zum Ermitteln der Extreme Values und der Top- und Bottom-5-Values muss er etwas komplexeres SQL bauen. Aber man sieht sehr schön, was man mit SQL-Standardfunktionen so alles anfangen kann ...
This SQL query is a bit more complex - for calculating the extreme values the top- and bottom-5-values the database needs to do some complex query. But it's very nice to see what's possible with standard SQL.

Keine Kommentare:

Beliebte Postings