Tuesday, March 08, 2011

Using Oracle to convert PDF to HTML

I've used Oracle Text on a couple of assignments. One of the handy things it can do is convert a PDF into an HTML or plain text document. Actually, you can do this sort of thing through Google Docs too, by uploading a PDF and saving it back to your desktop as a different file type. But I'm going to demo the Oracle mechanism.
 
Firstly, you need an Oracle directory. You may already have one. If not, your DBA will need to create it as the CREATE ANY DIRECTORY privilege is needed and this is one of those very dangerous privileges that a cautious DBA will keep under lock and key. If, like me, you are using a personal XE environment, you can do what you want.
 
create or replace directory pdf_files 
  as 'WHATEVER YOUR SOURCE DIR IS';
grant all on directory pdf_files to public;
 
The next step is a table to store the PDF in (as a BLOB) and the appropriate Oracle Text index.
 
create table ctx_demo
  (filename varchar2(30),
   text blob)
/
create index search_idx
  on ctx_demo (text)
  indextype is ctxsys.context
  parameters ('sync (on commit)')
/
 
The third step is a few simple procedures. One to store the file in the blob, the second to get the converted HTML as a CLOB, and the final procedure to write the CLOB as a file. In practice, they'd live happily together in a package.
 
create or replace procedure load_file
  (i_filename in varchar, 
   i_dirname in varchar2 default 'PDF_FILES')
is
  l_blob blob;
  l_bfile bfile;
begin
  insert into ctx_demo (filename, text)
    values ( i_filename, empty_blob() )
    returning text into l_blob;
  l_bfile := bfilename( i_dirname, i_filename );
  dbms_lob.fileopen( l_bfile );
  dbms_lob.loadfromfile
    (l_blob, l_bfile, dbms_lob.getlength( l_bfile ) );
  dbms_lob.fileclose( l_bfile );
end;
/
 
create or replace function get_html
  (i_filename in varchar2) return clob
is
  v_rowid rowid;
  v_res clob;
begin
  select rowid into v_rowid
  from ctx_demo
  where filename = i_filename;
  --
  CTX_DOC.FILTER(index_name => 'SEARCH_IDX',
    textkey => v_rowid,
    restab => v_res,
    plaintext => false);
    return v_res;
end;
/
create or replace procedure write_html 
  (i_filename in varchar2)
is
  v_clob     clob;
  v_out_file UTL_FILE.file_type;
  v_buffer   VARCHAR2(16000);
  v_amount   BINARY_INTEGER := 8000;
  v_pos      INTEGER := 1;
  v_clob_len INTEGER;
begin
  v_clob := get_html(i_filename||'.pdf');
  v_clob_len := DBMS_LOB.getlength(v_clob);
  v_out_file := UTL_FILE.fopen('PDF_FILES',
                      i_filename||'.html', 'w', 32767);
  WHILE v_pos <> v_clob_len LOOP
    DBMS_LOB.read (v_clob, v_amount, v_pos, v_buffer);
    UTL_FILE.put(v_out_file, v_buffer);
    UTL_FILE.fflush(v_out_file);
    v_pos := v_pos + v_amount;
  END LOOP;
  UTL_FILE.fclose(v_out_file);
END;
/
 
Once all the preparation is done, it is a simple matter of loading the PDF and saving the html.

For this demo, I'm using an old NoCoug journal, as they recently loaded issues for the past ten years. If you are interested in the complete collection, I've got the links listed on my NoCoug Journal page. Using XE, this conversion only works up to the August 2006 issue. The later ones (except for February 2009) don't get any useful output. Possibly a later Oracle edition would cope better with these PDFs. I'm crossing my fingers that the Easter Bunny will bring me a chocolate covered XE 11g.


begin
  load_file('NoCOUG_Journal_200605.pdf');
  write_html('NoCOUG_Journal_200605');
  commit;
end;
/
 
Remember PDFs aren't designed to be reverse engineered to HTML documents, and Oracle Text is just interested in the text, so there aren't any pictures.

No comments: