So shall we start right off the bat?
In this article, we tend to develop a simple Lucene 4 application with Apache Maven 2.
In all search projects, the challenge is the text itself. The reason is that text in considered unstructured data. In order to tackle this problem, Analyzers are trying to give some structure to the text they need to search into. They achieve this through tokenizing (Dividing text into text units such as words or a character sequence, a phrase, email address, ... ), stemming (Transforming each word token to its root) and stopping characters (characters that repeat often in the text and therefore are of less value) inside the stream.
Lucene comes up with a set of different Analyzers to be used in different situations. Here, we explains a set of them given the following text:
First: Theory1. WhitespaceAnalyzer
The simplest Analyzer in the package that each token starts after a white space and ends with a white space without any stemming and stopping method performed:
The Analyzer tokenizes the text into letters (with isLetter method in java.lang.Character class) abd applies a lower-case filter to it. This Analyzer shall have problems while working with far-eastern languages such as Chineese.
The most popular Analyzer for Enlgish texts that includes the following rules for creating token which suites most European Languages (StandardTokenizer):
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
The Analyzer is based on Martin Porter stemming algorithm. In this approach, we tokenize texts based on all three stemming, filtering to lower-case and also filtering stop words. This approach is highly useful however it should be noted that it can create meaningless tokens such as "tire" for the word "tired". This is no problem as Analyzer are used in fetching documents and do not affect the response which user will read afterwards.
The term bgram refers to subsequence of 2 characters from a string character. The Analyzer creates its index longer than other Analyzers. While it filters the text to lower case, in order to keep stop words in the index set, an underscore character "_" is placed between first and second part of the tokens. Start and End of a string will be indexed twice in this approach. The following is the tokens for "Bird on a wire":
It should be noted that idea of forming bgrams is based on finite-state machines. Lastly, it should be remembered that the big advantage of this approach is that it can match the queries which include stop words more precisely than others analyzers.
Since we are only starting, we will just start with setting pom.xml and building the application dependencies and we continue in the next blog post.
First, creating Maven Archetypes:
mvn archetype:generate -DgroupId=se.findwise -DartifactId=my-lucene-app -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Second, editing pom.xml:
Have you notices that in the
4.0.0 se.findwise my-lucene-app jar 1.0-SNAPSHOT my-lucene-app http://maven.apache.org maven-compiler-plugin 2.0.2 1.6 1.6 org.codehaus.mojo exec-maven-plugin java -Xms512m -Xmx512m -XX:NewRatio=3 -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:gc.log -classpath se.findwise.App junit junit 3.8.1 test org.apache.lucene lucene-core 4.0.0-BETA
Now just run the following and we are finished for today! ;)