SQL Server vs Oracle: Semantic Search

By: | Updated: 2022-06-21 | Comments (1) | Related: More > Full Text Search


Free MSSQLTips whitepaper - "Calculating Costs for Microsoft SQL Server" - download now

Problem

As a follow up to SQL Server vs Oracle: Data Loading we want to perform sentiment analysis, but before doing that it's important to understand the data. In the article Comparing SQL Server Full Text Search and Oracle Text Search we were able to search for common words, misspellings, and stemming (root and derivatives of a word), and now it's time to let the database tell us automatically which common phrases it's able to identify and how often they repeat (not identical but very similar). How is this performed in SQL Server, in Oracle, and what are their differences?

Solution

SQL Server and Oracle Semantic Search allow you to perform the following natural language understanding and processing: derive user intents and act if they match a query, sentiment analysis against a pre-defined dictionary-based sentiment, text mining, rank documents, key tags extraction, and related content discovery.

It allows you to perform statistical semantic (since a word can have multiple meanings or it changes within the context or depending on the surrounding words): return a scored table with semantic key tags, return a scored table with documents like one specified, and return the key tags that make the documents semantically similar.

Remember we're analyzing tweets, by getting key tags you can see which of them are the most influential over others and can also help us identify and remove bot/botnet/netcenter tweets that can deviate perception.

Note: SQL Server Semantic Search only indexes single words, not multi-word phrases (N-grams), and inflectional forms and synonyms are indexed separately.

In the next sections you will see how this is enabled for SQL Server and Oracle, and the differences.

Oracle

To install Oracle Text Search, you must execute the following steps:

Step 1 - If your database is new, because it uses Java classes you need to increase the JAVA_POOL_SIZE from the default 4 MB to 256 MB, and because it uses cursors/stored procedures/control structures/parallel execution if you have not set SGA_TARGET you need to increase SHARED_POOL_SIZE from the default 172 MB to 512 MB, this is done with the PowerShell command below and after that you need to bounce the instance performing a "shutdown immediate" and a "startup"; in the end you will increase the memory usage to about 600 MB:

 Clear-Content C:\app\Oracle19c\database\INITORCL.ORA
 [email protected]"
 db_name=ORCL
 java_pool_size=256M
 shared_pool_size=512M
 "@
 Add-Content C:\app\Oracle19c\database\INITORCL.ORA $content
 

Step 2 - If your database is new, because several objects will be created, you need to increase the datafiles for SYSTEM/SYSAUX/SYS_UNDOTS to prevent slowness and constant file increase operations, this is done with the SQL commands below connected as SYS:

 ALTER DATABASE DATAFILE 1 RESIZE 800M;
 ALTER DATABASE DATAFILE 2 RESIZE 300M;
 ALTER DATABASE DATAFILE 3 RESIZE 500M;
 

Step 3 - Verify there are no invalid installed components; if there are, you need to reinstall it by running the appropriate script. This is verified running the SQL command below connected as SYS and confirmed with the output as shown, you need to run it every time you install a new component:

 SET LINESIZE 300
 COL COMP_ID FOR A7
 COL COMP_NAME FOR A40
 COL VERSION FOR A10
 SELECT COMP_ID, COMP_NAME, VERSION, STATUS FROM DBA_REGISTRY;
 
 COMP_ID COMP_NAME VERSION STATUS
 ------- ---------------------------------------- ---------- -----------
 CATALOG Oracle Database Catalog Views 19.0.0.0.0 VALID
 CATPROC Oracle Database Packages and Types 19.0.0.0.0 LOADED
 RAC Oracle Real Application Clusters 19.0.0.0.0 OPTION OFF
 JAVAVM JServer JAVA Virtual Machine 19.0.0.0.0 VALID
 XDB Oracle XML Database 19.0.0.0.0 VALID
 

Step 4 - Verify there are no invalid objects; if there are you need to change directory as "cd C:\app\Oracle19c" and run the script connected as SYS with "@?\rdbms\admin\utlrp.sql". This is verified by running the SQL command below connected as SYS and confirmed with the output as shown, and you need to run it every time you install a new component:

 COL OWNER FOR A20
 COL OBJECT_TYPE FOR A15
 COL OBJECT_NAME FOR A30
 SET PAGESIZE 500
 SELECT OWNER, OBJECT_TYPE, OBJECT_NAME FROM DBA_OBJECTS WHERE STATUS <> 'VALID' ORDER BY OWNER, OBJECT_TYPE;
 
 no rows selected
 

Step 5 - Install Java Virtual Machine (JVM) component with the SQL commands below (it runs faster if you bounce the database before running it), otherwise you can't run SPARQL queries:

 spool C:\setup\initjvm.out
 @?\javavm\install\initjvm.sql
 spool off
 

Once run, confirm there were no errors with the PowerShell command below; if there were errors you need to resolve them and reinstall the component:

 Get-Content "C:\setup\initjvm.out" | Select-String "(ORA-)|(ERR)" | Select LineNumber, Line | Format-Table
 

There will be a new component installed as follows:

 COMP_ID COMP_NAME VERSION STATUS
 ------- ---------------------------------------- ---------- -----------
 JAVAVM JServer JAVA Virtual Machine 19.0.0.0.0 VALID
 

You also need to verify the Java option is enabled with the SQL command and output below:

 SELECT * FROM V$OPTION WHERE PARAMETER='Java';
 
 PARAMETER VALUE CON_ID
 ---------- ---------- ----------
 Java TRUE 0
 

And verify it works with the SQL commands and output below:

 SELECT DBMS_JAVA.LONGNAME('TEST') FROM DUAL;
 
 DBMS_JAVA.LONGNAME('TEST')
 --------------------------
 TEST
 

Step 6 - Even if Oracle XML database (XDB) is installed, you need to install XML XDK component with the SQL commands below (it runs faster if you bounce the database before running it), otherwise the ORA-29549 error is returned:

 spool C:\setup\xdk.out
 @?\xdk\admin\initxml.sql
 spool off
 

Once run, confirm there were no errors in the output file. There will be a new component installed as follows:

 COMP_ID COMP_NAME VERSION STATUS
 ------- ---------------------------------------- ---------- -----------
 XML Oracle XDK 19.0.0.0.0 VALID
 

Step 7 - Install Spatial and Graph (SDO) with the SQL commands below (it runs faster if you bounce the database before running it):

 spool C:\setup\mdinst.out
 @?\md\admin\mdinst.sql
 spool off
 

Once run, confirm there were no errors in the output file. In my case the following 4 files contained invalid characters and returned ORA-29913 and ORA-30653 so I had to copy them to my host machine, open in Notepad++, make a change like add a white space and delete it, save the files, and copy them back to the original location in the Docker container, these files are used in external table imports:

 C:\app\Oracle19c\md\admin\sdo_coord_ref_sys.txt
 C:\app\Oracle19c\md\admin\sdo_coord_ops.txt
 C:\app\Oracle19c\md\admin\sdo_coord_op_param_vals.txt
 C:\app\Oracle19c\md\admin\cs_srs.txt
 

There will be a new component installed as follows:

 COMP_ID COMP_NAME VERSION STATUS
 ------- ---------------------------------------- ---------- ------------
 SDO Spatial 19.0.0.0.0 VALID
 

And you can verify the installed SDO components with the SQL commands and its output below:

 COL NAMESPACE FOR A10
 COL ATTRIBUTE FOR A11
 COL VALUE FOR A10
 COL DESCRIPTION FOR A40
 SELECT NAMESPACE, ATTRIBUTE, VALUE, DESCRIPTION FROM MDSYS.RDF_PARAMETER;
 
 NAMESPACE ATTRIBUTE VALUE DESCRIPTION
 ---------- ----------- ---------- ----------------------------------------
 COMPONENT RDFCTX INSTALLED Semantic (Text) Search component
 COMPONENT RDFOLS INSTALLED RDF Optional component for OLS support
 MDSYS SEM_VERSION 19.1.0.0.0 VALID
 

Test Oracle Semantic Search

Once Semantic Search is installed, you can perform a test with the steps below, part of the instructions were taken from Oracle document "How To Use The GATE Extractor To Create SemContext Indexes (Doc ID 1550829.1)" which is outdated:

Step 1 - In your host machine, install General Architecture for Text Engineering (GATE) which is an open-source natural language processor and information extractor, it can be downloaded from here and the installer is named gate-developer-9.0.1-installer.exe with size 57.9 MB. Note two folders will be created, "C:\Program Files (x86)\GATE_Developer_9.0.1" and "C:\Users\pabechevb\.m2\repository\uk\ac\gate\plugins\annie\9.1". Once installed, open the new installed application "GATE 9.0.1", click on File > Manage CREOLE Plugins > select ANNIE (9.1) > click on the button "Extract Plugin Resources" and select the C:\temp\ folder, there will be a file named "ANNIE_with_defaults.gapp" and it is used when launching a GateListener described later.

Step 2 - Create a listener that will receive requests from the database and will reply with the GATE response. The sample can be downloaded from here with filename gatelistener.zip and size 3.18 KB. However, this is for version 5 and won't work for version 9, instead use the attached file which I modified to make it work in version 9 named GateListener.java, then to compile it I placed it at C:\temp.

First you need to compile the file with the CMD commands below, note it requires JDK for the compiler so in my case I'm using the executable from the Oracle installer, also note you need to reference the GATE jar file:

 cmd
 cd C:\temp
 "C:\temp\WINDOWS.X64_193000_db_home\jdk\bin\javac.exe" -classpath "C:\Program Files (x86)\GATE_Developer_9.0.1\lib\gate-core-9.0.1.jar" GateListener.java
 

It compiles the class in GateListener.java and outputs a file named GateListener.class, then you can start the listener with the CMD commands below, note I have JRE installed locally so this is the executable I'm using, and Windows Firewall will ask you to open the port which I left as 12000:

 cmd
 cd C:\temp
 mkdir ANNIE
 copy ANNIE_with_defaults.gapp .\ANNIE
 "C:\Program Files (x86)\Java\jre1.8.0_333\bin\java.exe" -classpath .;"C:\Program Files (x86)\GATE_Developer_9.0.1\bin\*";"C:\Program Files (x86)\GATE_Developer_9.0.1\lib\*" GateListener 12000
 

Once the listener is running, you can test it is reachable from the Docker container with the PowerShell command below replacing the IP with your host computer IP; note this will crash the program and you need to restart the listener:

 Test-NetConnection -ComputerName 192.168.0.4 -InformationLevel "Detailed" -Port 12000
 

Step 3 - Connect to the database with the SYSTEM user (SYS doesn't work because it has too many privileges, MDSYS is configured to not allow connecting to it, and a normal user lacks several privileges needed) and configure the GATE host and port with the SQL commands below, replacing the IP with your host machine IP; note this is done once per database, and if it's not configured you'll get the ORA-13199 error later:

 begin
 sem_rdfctx.set_extractor_param (
 param_key => 'GATE_NLP_HOST',
 param_value => '192.168.0.4',
 param_desc => 'Host for GATE NLP Listener');
 sem_rdfctx.set_extractor_param (
 param_key => 'GATE_NLP_PORT',
 param_value => '12000',
 param_desc => 'Port for Gate NLP Listener');
 end;
 /
 

Step 4 - Grant access to the SYSTEM user to connect to external hosts with the SQL commands below, otherwise you'll get the ORA-24247 error later:

 begin
 dbms_network_acl_admin.create_acl (
 acl => 'SEMINDEX',
 description => 'Allow query SPARQL endpoints',
 principal => 'SYSTEM',
 is_grant => true,
 privilege => 'connect');
 dbms_network_acl_admin.assign_acl (
 acl => 'SEMINDEX',
 host => '*');
 end;
 /
 

You can verify it is created with the SQL commands below:

 COL PRINCIPAL FOR A10
 COL HOST FOR A12
 COL LOWER_PORT FOR 99999
 COL UPPER_PORT FOR 99999
 COL ACL FOR A30
 COL PRIVILEGE FOR A9
 COL PRIVILEGE_STATUS FOR A16
 SET LINESIZE 300
 SET PAGESIZE 500
 SELECT PRINCIPAL, HOST, LOWER_PORT, UPPER_PORT, ACL, 'CONNECT' AS PRIVILEGE, 
 DECODE(DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE_ACLID(aclid, PRINCIPAL, 'connect'), 1,'GRANTED', 0,'DENIED', NULL) PRIVILEGE_STATUS
 FROM DBA_NETWORK_ACLS
 JOIN DBA_NETWORK_ACL_PRIVILEGES USING (ACL, ACLID) 
 UNION ALL
 SELECT PRINCIPAL, HOST, NULL lower_port, NULL upper_port, acl, 'resolve' AS PRIVILEGE, 
 DECODE(DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE_ACLID(aclid, PRINCIPAL, 'resolve'), 1,'GRANTED', 0,'DENIED', NULL) PRIVILEGE_STATUS
 FROM DBA_NETWORK_ACLS
 JOIN DBA_NETWORK_ACL_PRIVILEGES USING (ACL, ACLID);
 

And you can test with the SQL commands below (remember to restart the listener after it crashes):

 DECLARE
 l_conn Sys.UTL_TCP.connection;
 BEGIN
 l_conn := utl_tcp.open_connection(remote_host =>'192.168.0.4',remote_port => 12000);
 utl_tcp.close_connection(l_conn);
 END;
 /
 

Step 5 - Create a semantic network with the SQL commands below, specifying the tablespace where to create it:

 EXECUTE SEM_APIS.CREATE_SEM_NETWORK(TABLESPACE_NAME => 'SYSTEM');
 

Then create a Semantic Text policy to use GATE extractor with the SQL commands below, the extractor can be tuned and can be extended with additional ontologies but this is the simplest form:

 begin
 sem_rdfctx.create_policy (policy_name => 'SEM_EXTR',
 extractor => mdsys.gatenlp_extractor());
 end;
 /
 

Step 6 - After the data has been imported as shown in Comparing SQL Server Full Text Search and Oracle Text Search, create the semantic index with the SQL commands below, you will notice activity in the listener:

 CREATE INDEX RDFINDEX ON MYDB.MYTBL(TEXT) INDEXTYPE IS MDSYS.SemContext PARAMETERS('SEM_EXTR');
 

Verify there were 0 errors during the index creation with the SQL command below:

 select count(*) from mdsys.rdfctx_index_exceptions;
 

Now you need the semantic model name created with the SQL query below and its output as shown:

 SELECT owner, model_id, model_name, table_name, column_name
 FROM MDSYS.SEM_MODEL$
 WHERE model_name like 'RDFCTX%';
 
 OWNER MODEL_ID MODEL_NAME TABLE_NAME COLUMN_NAM
 ---------- ---------- --------------- ---------- ----------
 SYSTEM 1 RDFCTX_MOD_1
 

And you can start running SPARQL queries, replacing the model's name as returned from the previous query, to get every relation created:

 SELECT s, p, o
 FROM TABLE(SEM_MATCH(
 '(?s ?p ?o)',
 SEM_Models('RDFCTX_MOD_1'),
 null,
 null,
 null));
 

And you can query which records match any of the relations with the SPARQL query below:

 SELECT TEXT FROM MYDB.MYTBL WHERE SEM_CONTAINS(TEXT, 'SELECT ?s ?p ?o WHERE {?s ?p ?o}', 1) = 1;
 

There are additional operators, like SEM_CONTAINS_SELECT which returns additional information about each document, and SEM_CONTAINS_COUNT which counts matching subgraphs for the pattern specified. You can also add annotations to the documents using SEM_RDFCTX.MAINTAIN_TRIPLES procedure. And Oracle can extract plain text version from formatted documents like Word, RTF, PDF using filters and CTX_DOC.POLICY_FILTER which requires Oracle Text installed.

The official documentation can be found in this link, this link and this link.

SQL Server

In the article Comparing SQL Server Full Text Search and Oracle Text Search you learned how to install Full Text Search, which is a prerequisite for Semantic Search and allows you to classify and compare documents no matter their type. To install semantic search, you must execute the following additional steps:

Step 1 - Find the SemanticLanguageDatabase in the SQL Server installation media with the command and output below:

 Get-ChildItem -Path C:\setup -Filter SemanticLanguageDatabase.msi -Recurse -ErrorAction SilentlyContinue -Force
 
 Directory: C:\setup\1033_ENU_LP\x64\Setup
 Mode LastWriteTime Length Name
 ---- ------------- ------ ----
 -a---- 9/24/2019 10:53 PM 185946112 SemanticLanguageDatabase.msi
 
 Directory: C:\setup\x64\Setup
 Mode LastWriteTime Length Name
 ---- ------------- ------ ----
 -a---- 9/24/2019 11:33 PM 185946112 SemanticLanguageDatabase.msi
 

Step 2 - Run the installer, which is going to decompress the database files with the command below, there is no output:

 & C:\setup\x64\Setup\SemanticLanguageDatabase.msi
 

Step 3 - Search for the decompressed database files with the command and output below:

 Get-ChildItem -Path C:\ -Filter semanticsdb.mdf -Recurse -ErrorAction SilentlyContinue -Force
 
 Directory: C:\Program Files\Microsoft Semantic Language Database
 Mode LastWriteTime Length Name
 ---- ------------- ------ ----
 -a---- 9/24/2019 2:21 PM 226426880 semanticsDB.mdf
 

Step 4 - Attach the database to the server instance with the command and output below:

 CREATE DATABASE semanticsdb
 ON ( FILENAME ='C:\Program Files\Microsoft Semantic Language Database\semanticsdb.mdf' )
 LOG ON ( FILENAME ='C:\Program Files\Microsoft Semantic Language Database\semanticsdb_log.ldf' )
 FOR ATTACH;
 GO
 
 Converting database 'semanticsdb' from version 855 to the current version 904.
 Database 'semanticsdb' running the upgrade step from version 855 to version 856.
 
 Database 'semanticsdb' running the upgrade step from version 903 to version 904.
 

Step 5 - Register the semantic statistics database with the command below, there is no output:

 EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = N'semanticsdb';
 

Step 6 - Verify the semantic database is registered and check its version with the command and output below:

 SELECT * FROM sys.fulltext_semantic_language_statistics_database;
 GO
 
 database_id register_date registered_by version
 ----------- ----------------------- ------------- ---------------
 6 2022-04-08 13:47:43.470 1 11.0.1153.1.3
 

Step 7 - Verify the installed document parsers with the command below:

 SELECT * FROM sys.fulltext_document_types;
 

Step 8 - By default Semantic Search doesn't have Office 2007-2010 parsers installed, you need to download them, the file is named "FilterPack64bit.exe" with size 3.87 MB and its version is 2010, you need to install it with the commands below, there is no output and the instructions are described in this link:

 cmd
 c:\setup\FilterPack64bit.exe
 exit
 

Step 9 - You also need to install the PDF parser, you need to download it, the file is named "PDFFilter64Setup.msi" with size 19.6 MB and its version is 11.0.1.36, you need to install it with the command below, there is no output:

 & c:\setup\PDFFilter64Setup.msi
 

Step 10 - Once the new filters are installed, you need to load them into Semantic Search with the commands below, there is no output:

 EXEC sp_fulltext_service 'update_languages';
 GO
 EXEC sp_fulltext_service 'load_os_resources', 1;
 GO
 EXEC sp_fulltext_service 'restart_all_fdhosts';
 GO
 

Then you can query again the installed document parsers as before and see their location and version number.

Now you can alter the existing full text index, but in my case, I'm going to drop and recreate it with the commands below, note the only difference with a full text index is "STATISTICAL_SEMANTICS":

 USE MyDb
 GO
 DROP FULLTEXT INDEX ON [dbo].[MyTbl];
 GO
 CREATE FULLTEXT INDEX ON dbo.MyTbl(Text STATISTICAL_SEMANTICS) KEY INDEX ID_TWEET ON TWEETS WITH CHANGE_TRACKING AUTO;
 

Test SQL Server Semantic Search

You can find key phrases in all tweets with the query below, they are called phrases but in reality, those are single words, it is ordered by score so the most important words appear at the top:

 SELECT column_id, document_key, keyphrase, score
 FROM SEMANTICKEYPHRASETABLE(MyTbl, *)
 ORDER BY score DESC;
 

You can find the most important key phrases in all tweets with the query below, it is ordered by count, so the most common words appear at the top:

 SELECT keyphrase, COUNT(1)
 FROM SEMANTICKEYPHRASETABLE(MyTbl, *)
 GROUP BY keyphrase
 ORDER BY 2 DESC;
 

You can find similar or related tweets with the query below, because it is done per document ID we need to store intermediate results and this query may not work with large datasets so you will need to run it in batches:

 CREATE TABLE #Info ([SourceDocumentKey] INT, [SourceColumnId] INT, [MatchedColumnId] INT, [MatchedDocumentKey] INT, [Score] REAL)
 DECLARE @cmd VARCHAR(MAX)
 SET @cmd =''
 SELECT @cmd = @cmd + 'INSERT #Info EXEC(''
 SELECT '+CAST([ID] AS VARCHAR(10))+', * FROM SEMANTICSIMILARITYTABLE(MyTbl, *, '+CAST([ID] AS VARCHAR(10))+')'');'
 FROM [MyTbl];
 EXEC (@cmd)
 SELECT s.Text, m.Text, Score
 FROM #Info [i]
 INNER JOIN MyTbl s on s.ID = i.SourceDocumentKey
 INNER JOIN MyTbl m on m.ID = i.MatchedDocumentKey
 WHERE Score > 0.5
 ORDER BY Score DESC;
 DROP TABLE #Info
 

In my case there were no results because I'm selecting tweets that are at least 50% similar and there were none. This tells me in this dataset there are no tweets influencing others and they're not generated by bots/botnets/netcenters.

You can find the key phrases that make two documents similar or related, which means their words repeat often, with the query below; MyTbl is the loaded table, Text is the column for both comparisons, and 39 and 56 are the document IDs in the database:

 SELECT keyphrase, score FROM SEMANTICSIMILARITYDETAILSTABLE(MyTbl, Text, 39, Text, 56);
 

The official documentation can be found here: Semantic search (contains several subtopics), Semantic search DDL, functions, stored procedures and views, Exploring Semantic Search Key Term Relevance.

Conclusion

You can see installing Semantic Search is a complex topic both in Oracle and SQL Server. In Oracle this doesn't depend on Oracle Text but having both installed allows you to perform additional actions, and SPARQL adds another level of difficulty. In SQL Server the queries remain as T-SQL and are easy to perform, but the database and Office extractors are a little out of date, so you need to test them first.

Next Steps



Get Started Now - Click here to get your free 14 day trial of SolarWinds Database Insights








About the author
Pablo Echeverria worked for more than 10 years as a software programmer and analyst. In 2016, I switched jobs to a DBA position, where I have implemented new processes, created better monitoring tools and grown my data scientist skills.

View all my tips


Article Last Updated: 2022-06-21

Comments For This Article




Thursday, June 23, 2022 - 2:48:29 PM - Pablo EcheverriaBack To Top(90190)
Below is the code for GateListener.java:

/**
* The GateListener class is a sample listener implementation for the GATE
* engine that interacts with the Semantic indexing component in Oracle
* Database. The sample code uses the ANNIE extraction system and is tested
* with GATE Release 5.0 and JDK 1.5. The code may be modified to use custom
* gate plugins for domain-specific information extraction. This file
* must be compiled with a classpath that includes the jar files from the
* bin and lib directories of a GATE installation.
*
* See Oracle Database Semantic Technologies Developer's Guide for
* details about semantic indexing for documents.
*/

import gate.Annotation;
import gate.AnnotationSet;
import gate.Corpus;
import gate.Document;
import gate.DocumentContent;
import gate.Factory;
import gate.FeatureMap;
import gate.Gate;
import gate.ProcessingResource;
import gate.corpora.DocumentContentImpl;
import gate.corpora.DocumentXmlUtils;
import gate.creole.ANNIEConstants;
import gate.creole.ExecutionException;
import gate.creole.ResourceInstantiationException;
import gate.creole.SerialAnalyserController;
import gate.util.GateException;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

import java.io.File;
import gate.util.persistence.PersistenceManager;
import gate.creole.ConditionalSerialAnalyserController;

public class GateListener implements Runnable
{
private ConditionalSerialAnalyserController annieController = null;
private ServerSocket serverSocket = null;
private Thread sockThread = null;
private boolean reqEndThread = false;
private static final int READ_CHUNK_SIZE = 16*1024;

/**
* GateListener constructor.
* @param port - port at which the Gate listener is initialized
* @throws GateException
*/
public GateListener(int port)
throws GateException, IOException
{
Gate.init();

// create a serial analyser controller to run ANNIE with
/*annieController = (SerialAnalyserController) Factory.createResource(
"gate.creole.SerialAnalyserController", Factory.newFeatureMap(),
Factory.newFeatureMap(), "ANNIE_" + Gate.genSym());

// load each PR as defined in ANNIEConstants
for(int i = 0; i < ANNIEConstants.PR_NAMES.length; i++) {
FeatureMap params = Factory.newFeatureMap(); // use default parameters
System.out.println("Annie constant:"+ANNIEConstants.PR_NAMES[i]);
System.out.println(params.toString());
ProcessingResource pr = (ProcessingResource)
Factory.createResource(ANNIEConstants.PR_NAMES[i], params);

// add the PR to the pipeline controller
annieController.add(pr);
} // for each ANNIE PR*/

annieController = (ConditionalSerialAnalyserController)
PersistenceManager.loadObjectFromFile(new File(new File(
Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR),
ANNIEConstants.DEFAULT_FILE));

// start the listener
try {
serverSocket = new ServerSocket(port);
}
catch (IOException e)
{
System.err.println("Could not read from the connection "+e.getMessage());
System.exit(1);
}

// start the thread that handles extraction requests
sockThread = new Thread(this);
sockThread.start();
System.out.println("Started the Listener on port "+port);
}

/**
* To build document content over a stream.
* @param inStream - input stream. The input stream has the length of the
* document followed by the content of the document.
* @return
* @throws IOException
*/
private DocumentContentImpl getDocumentStreamContent(InputStream inStream)
throws IOException
{
int readLength = 0;
int bufLength = 0;
String sdocLength;
int docLength;
char[] readBuffer = new char[READ_CHUNK_SIZE];

BufferedReader uReader = null;
StringBuffer buf = new StringBuffer();

uReader = new BufferedReader(new InputStreamReader(inStream));

// First field is the length of the doc to be read
sdocLength = uReader.readLine();

if (sdocLength == null)
{
return new DocumentContentImpl(buf.toString());
}
try {
docLength = Integer.parseInt(sdocLength);
} catch (NumberFormatException nfe)
{
throw new RuntimeException("invalid document format");
}
readLength = 0;
while (readLength < docLength)
{
bufLength = uReader.read(readBuffer, 0, READ_CHUNK_SIZE);
if (bufLength == -1) break;
buf.append(readBuffer, 0, bufLength);
readLength = readLength+bufLength;
}

return new DocumentContentImpl(buf.toString());
}

/**
* Write to the output stream.
* @param out - output stream for the write operation
* @param control - control information (-1 if error; else actual length of content)
* @param content - content to be written to the stream
*/
private void writeToStream(PrintWriter out, int control, String content)
{
out.println(control);
out.print(content);
}

/**
* Listener logic for accepting documents over a socket connection,
* processing them, and writing the extracted information back to the
* socket.
*/
public void run()
{
while (!reqEndThread && serverSocket.isBound())
{
try {
Socket client = serverSocket.accept();
DocumentContentImpl docContent;
PrintWriter out = new PrintWriter(client.getOutputStream(), true);
Corpus corpus = (Corpus) Factory.createResource("gate.corpora.CorpusImpl");
Document doc = (Document)Factory.createResource("gate.corpora.DocumentImpl");

// after accepting a socket connection from the database, start reading
// the document.
try {
docContent = getDocumentStreamContent(client.getInputStream());
} catch (Exception genExcep)
{
writeToStream(out, -1, "Error reading document "+genExcep.getMessage());
client.close();
continue; // start accepting new connections.
}

doc.setContent(docContent);
corpus.add(doc);

annieController.setCorpus(corpus);
annieController.execute();

// get the annotation set for the document
AnnotationSet annotSet = doc.getAnnotations();
HashSet<String> annotTypesRequired = new HashSet<String>();
annotTypesRequired.add("Person");
annotTypesRequired.add("Location");
annotTypesRequired.add("FirstPerson");
annotTypesRequired.add("Date");
annotTypesRequired.add("Organization");
annotTypesRequired.add("Lookup");
AnnotationSet reqAnnots = annotSet.get(annotTypesRequired);
String content = docContent.toString();
Iterator iter = reqAnnots.iterator();

// For each annotation, add a feature map that includes the
// value that led to the annotation. This way the XML representation
// of the annotation set is complete.
while (iter.hasNext())
{
Annotation currAnnot = (Annotation)iter.next();
FeatureMap featureMap = currAnnot.getFeatures();
int stpos = (int)currAnnot.getStartNode().getOffset().longValue();
int edpos = (int)currAnnot.getEndNode().getOffset().longValue();
String entity = content.substring(stpos, edpos);
featureMap.put("entityValue", entity);
}

// generate the XML document with all annotations
StringBuffer extrInfo = new StringBuffer();
DocumentXmlUtils.annotationSetToXml(reqAnnots, extrInfo);
// write the XML document to the output stream
writeToStream(out, extrInfo.length(), extrInfo.toString());

out.close();
doc.cleanup();
corpus.cleanup();
client.close();
} catch (IOException ioe)
{
reqEndThread = true;
System.err.println("FATAL ERROR : "+ioe.getMessage());
} catch (ResourceInstantiationException rie)
{
reqEndThread = true;
System.err.println("FATAL ERROR : "+rie.getMessage());
} catch (ExecutionException execExcep)
{
reqEndThread = true;
System.err.println("FATAL ERROR : "+execExcep.getMessage());
}
}
}

public static void main(String[] args)
throws GateException, IOException
{
int port = 0;
if (args.length != 1)
{
System.err.println("Usage : GateListener [port-number]");
System.exit(1);
}

try
{
port = Integer.parseInt(args[0]);
if (port < 0)
throw new NumberFormatException("negative number");
}
catch (NumberFormatException nfe)
{
System.err.println("Invalid port number");
System.exit(1);
}

new GateListener(port);
}
}