29. September 2008

String Tokenizer für PL/SQL

English title: String tokenizer for PL/SQL

Letzte Woche musste ich (wie so oft) einen String anhand eines Trennzeichens (Delimiter) in einzelne Teile zu zerlegen. In Java ist das unproblematisch: Da gibt es die Klasse java.util.StringTokenizer, mit der man das ganz einfach machen kann. So etwas schwebte mir auch für PL/SQL vor - und deshalb poste ich heute etwas generischen PL/SQL-Code als Äquivalent zur besagten Java Klasse: Den PL/SQL String-Tokenizer.
Last week I had to tokenize a string ... Not for the first time, of course, but ow I decided to create a generic solution for this. In Java it is very simple: There is the java.util.StringTokenizer class which gives a very easy interface to tokenize a string and to access to tokens. I'd like such an API also for PL/SQL and therefore I created some PL/SQL code which should behave similar to the java class: The PL/SQL string tokenizer.
Nun findet man im Internet zahlreiche Codeschnipsel zu diesem Thema: Ich bin jedoch bewusst den Weg mit dem Objekktypen gegangen - dadurch wird es auch möglich, in einer Datenbanksession mehrere Tokenizer-Objekte parallel zu verwenden - wenn man das in einem Package haben möchte, muss man die Verwaltung der einzelnen "Instanzen" komplett selbst programmieren - das war mir zu umständlich. Die Java-Klasse java.util.StringTokenizer war Vorbild für den PL/SQL Objekttypen.
Although there are more PL/SQL string tokenizer code snippets than sand at the beach I've written another implementation: as a database object type. The reason for this is that with this approach you can have multiple instances of your tokenizer in your session at the same time. Using simple PL/SQL packages would require me to code the instance handling by myself - I considered this as too much effort. The java StringTokenizer was the model for the API of the PL/SQL object type.
Das folgende Skript erzeugt zwei Objekttypen: Der Typ STRING_TOKENIZER zerlegt VARCHAR2-Datentypen - und da ich mit dem Objekttypen arbeite, gilt die SQL-Grenze von 4.000 Bytes. Für größere Objekte wird der Typ CLOB_TOKENIZER angelegt - darin habe ich die Funktionen des DBMS_LOB Package verwendet; dieser Typ kann also beliebig große Zeichenketten (CLOB) zerlegen. CLOB_TOKENIZER ist wegen der DBMS_LOB-API langsamer als STRING_TOKENIZER, man sollte für kurze Strings (bis 4.000 bytes) also stets den STRING_TOKENIZER nutzen.
The following script generated two object types: STRING_TOKENIZER is for dealing with VARCHAR2 contents - and since this is an object type, the SQL limit of 4.000 bytes applies. For larger Strings there is the CLOB_TOKENIZER type. This one uses the DBMS_LOB API for parsing the string and therefore can tokenize any string regardless of its length. Therefore the CLOB_TOKENIZER is slower than STRING_TOKENIZER so for short strings you should always use STRING_TOKENIZER.
drop type clob_tokenizer
/

drop type string_tokenizer
/

drop type token_list
/

drop type token_t
/

create type token_t as object(
  token_text varchar2(4000),
  start_pos  number,
  length     number
)
/

create type token_list as table of token_t
/

create type clob_tokenizer as object(
  value_string       clob,
  delimiter          varchar2(10),
  parser_current_pos number,
  last_token         varchar2(4000),
  constructor function clob_tokenizer (p_string in clob, p_delim in varchar2) return self as result,
  member function has_more_tokens return number,
  member function next_token(self in out nocopy clob_tokenizer) return varchar2,
  static function all_tokens (p_string in clob, p_delim in varchar2) return token_list pipelined parallel_enable,
  static function all_tokens_cur (p_cursor in sys_refcursor, p_delim in varchar2) return token_list pipelined parallel_enable (partition p_cursor by any)
);
/
sho err

create or replace type body clob_tokenizer is
  constructor function clob_tokenizer (p_string in clob, p_delim in varchar2) return self as result as
  begin
    self.value_string := p_string;
    self.delimiter := p_delim;
    self.parser_current_pos := 1;
    self.last_token := null;
    return ;
  end;
    
  member function has_more_tokens return number as
  begin
    if self.parser_current_pos <= dbms_lob.getlength(value_string) then 
      return 1;
    else 
      return 0;
    end if;
  end;

  member function next_token(self in out nocopy clob_tokenizer) return varchar2 is
    l_next_delim_pos   number;
    l_token            varchar2(4000);
  begin
    if self.has_more_tokens() = 1 then 
      l_next_delim_pos := dbms_lob.instr(self.value_string, self.delimiter, self.parser_current_pos);
      if l_next_delim_pos = 0 then
        l_token := dbms_lob.substr(
          lob_loc => self.value_string, 
          amount  => (dbms_lob.getlength(self.value_string) - self.parser_current_pos) + 1,  
          offset  => self.parser_current_pos
        );
        parser_current_pos := dbms_lob.getlength(self.value_string) + 1; 
      else 
        l_token := dbms_lob.substr(
          lob_loc => self.value_string, 
          amount  => l_next_delim_pos  - self.parser_current_pos, 
          offset  => self.parser_current_pos
        );
        parser_current_pos := l_next_delim_pos + length(self.delimiter);
      end if;
    else 
      l_token := null;
    end if;
    self.last_token := l_token;
    return l_token;
  end;

  static function all_tokens (p_string in clob, p_delim in varchar2) return token_list pipelined parallel_enable is
    l_st clob_tokenizer := clob_tokenizer(p_string, p_delim);
    l_startpos number;
    l_token    varchar2(4000);
  begin
    while l_st.has_more_tokens = 1 loop
      l_startpos := l_st.parser_current_pos;
      l_token := l_st.next_token();
      pipe row (token_t(l_token, l_startpos, nvl(length(l_token),0)));
    end loop;
    return;
  end;

  static function all_tokens_cur (p_cursor in sys_refcursor, p_delim in varchar2) return token_list pipelined parallel_enable (partition p_cursor by any) is
    l_st       clob_tokenizer;
    l_string   clob;
    l_startpos number;
    l_token    varchar2(4000);
  begin
    loop
      fetch p_cursor into l_string;  
      exit when p_cursor%notfound;
     
      l_st := clob_tokenizer(l_string, p_delim);
      while l_st.has_more_tokens = 1 loop
        l_startpos := l_st.parser_current_pos;
        l_token := l_st.next_token();
        pipe row (token_t(l_token, l_startpos, nvl(length(l_token),0)));
      end loop;
    end loop;
    return;
  end;

end;
/
sho err

create type string_tokenizer as object(
  value_string       varchar2(4000),
  delimiter          varchar2(10),
  parser_current_pos number,
  last_token         varchar2(4000),
  constructor function string_tokenizer (p_string in varchar2, p_delim in varchar2) return self as result,
  member function has_more_tokens(self in out nocopy string_tokenizer) return number,
  member function next_token(self in out nocopy string_tokenizer) return varchar2,
  static function all_tokens (p_string in varchar2, p_delim in varchar2) return token_list pipelined parallel_enable,
  static function all_tokens_cur (p_cursor in sys_refcursor, p_delim in varchar2) return token_list pipelined parallel_enable (partition p_cursor by any)
);
/
sho err

create or replace type body string_tokenizer is
  constructor function string_tokenizer (p_string in varchar2, p_delim in varchar2) return self as result as
  begin
    self.value_string := p_string;
    self.delimiter := p_delim;
    self.parser_current_pos := 1;
    self.last_token := null;
    return ;
  end;
    
  member function has_more_tokens(self in out nocopy string_tokenizer) return number as
  begin
    if self.parser_current_pos <= length(value_string) then 
      return 1;
    else 
      return 0;
    end if;
  end;

  member function next_token(self in out nocopy string_tokenizer) return varchar2 as
    l_next_delim_pos   number;
    l_next_enclose_pos number;
    l_token            varchar2(4000);
  begin
    if self.has_more_tokens() = 1 then 
      l_next_delim_pos := instr(self.value_string, self.delimiter, self.parser_current_pos);
      if l_next_delim_pos = 0 then
        l_token := substr(value_string, self.parser_current_pos);
        parser_current_pos := length(self.value_string) + 1; 
      else 
        l_token := substr(self.value_string, self.parser_current_pos, l_next_delim_pos  - self.parser_current_pos);
        parser_current_pos := l_next_delim_pos + length(self.delimiter);
      end if;
    else 
      l_token := null;
    end if;
    self.last_token := l_token;
    return l_token;
  end;

  static function all_tokens (p_string in varchar2, p_delim in varchar2) return token_list pipelined parallel_enable is
    l_st string_tokenizer := string_tokenizer(p_string, p_delim);
    l_startpos number;
    l_token    varchar2(4000);
  begin
    while l_st.has_more_tokens = 1 loop
      l_startpos := l_st.parser_current_pos;
      l_token := l_st.next_token();
      pipe row (token_t(l_token, l_startpos, nvl(length(l_token),0)));
    end loop;
    return;
  end;

  static function all_tokens_cur (p_cursor in sys_refcursor, p_delim in varchar2) return token_list pipelined parallel_enable (partition p_cursor by any) is
    l_st       string_tokenizer;
    l_string   varchar2(4000);
    l_startpos number;
    l_token    varchar2(4000);
  begin
    loop
      fetch p_cursor into l_string;  
      exit when p_cursor%notfound;
     
      l_st := string_tokenizer(l_string, p_delim);
      while l_st.has_more_tokens = 1 loop
        l_startpos := l_st.parser_current_pos;
        l_token := l_st.next_token();
        pipe row (token_t(l_token, l_startpos, nvl(length(l_token),0)));
      end loop;
    end loop;
    return;
  end;
end;
/
sho err
Die Typen können prozedural ...
You can use the types in a procedural manner ...
create or replace procedure st_tester (p_string in varchar2, p_delim in varchar2) is
  v_st string_tokenizer := string_tokenizer(p_string, p_delim);
  v_cnt pls_integer := 0;
begin
  while v_st.has_more_tokens() = 1 loop
    dbms_output.put_line(v_cnt||': '||v_st.next_token());
    v_cnt := v_cnt + 1;
  end loop;
end;
/

exec st_tester('##a#b#c#d#e#f##','#');

0: ?
1: ?
2: a
3: b
4: c
5: d
6: e
7: f
8: ?
... oder in einer SQL-Abfrage genutzt werden.
... as well as in a SQL query.
select rownum, token_text, start_pos, length from table(string_tokenizer.all_tokens('##a#b#c#d#e#f##', '#') );

    ROWNUM TOKEN_TEXT                      START_POS     LENGTH
---------- ------------------------------ ---------- ----------
         1 ?                                       1          0
         2 a                                       2          1
         3 b                                       4          1
         4 c                                       6          1
         5 d                                       8          1
         6 e                                      10          1
         7 f                                      12          1
         8 ?                                      14          0
Zurück zum Ursprungsproblem: Die Tabelle mit der kommaseparierten Spalte (TABELLE.MERKMALE (VARCHAR2(400)) konnte ich wie folgt "normalisieren".
Back to the initial problem. Using those types I could "normalize" the table with the column containing comma-separated content (TABELLE.MERKMALE VARCHAR2(400)) using a simple SQL statement.
create table m1 as 
select t.id, m.token_text from tabelle t, table(string_tokenizer.all_tokens(t.merkmale, ',')) m
/

Tabelle wurde erstellt.

Abgelaufen: 00:00:05.14


select count(*) from m1;

  COUNT(*)
----------
    204006
Von der Zeit her gar nicht sooo schlecht ... für einen Laptop.
For a laptop computer the consumed time seems not too bad ...

Beliebte Postings