“If you can't explain it to a six year old, you don't
understand it yourself.”
― Albert Einstein
Wednesday, September 5, 2012
Getting Started with Apache Lucene 4 with Maven: Hello World (Part 1: Analyzers)
Apache Lucene is a powerful search-engine library that stands for a perfect tool in case you somehow need to search unstructured data/text inside your application. Lucene website has a nice beginner tutorial here, however details are not provided and beginners might be lost with terminologies that are coded and used inside that tutorial without knowing them.
So shall we start right off the bat?
In this article, we tend to develop a simple Lucene 4 application with Apache Maven 2.
In all search projects, the challenge is the text itself. The reason is that text in considered unstructured data. In order to tackle this problem, Analyzers are trying to give some structure to the text they need to search into. They achieve this through tokenizing (Dividing text into text units such as words or a character sequence, a phrase, email address, ... ), stemming (Transforming each word token to its root) and stopping characters (characters that repeat often in the text and therefore are of less value) inside the stream.
Lucene comes up with a set of different Analyzers to be used in different situations. Here, we explains a set of them given the following text:
First: Theory
1. WhitespaceAnalyzer
The simplest Analyzer in the package that each token starts after a white space and ends with a white space without any stemming and stopping method performed:
public WhitespaceAnalyzer(Version.LUCENE_36)
2. SimpleAnalyzer
The Analyzer tokenizes the text into letters (with isLetter method in java.lang.Character class) abd applies a lower-case filter to it. This Analyzer shall have problems while working with far-eastern languages such as Chineese.
public SimpleAnalyzer(Version.LUCENE_36)
3. StandardAnalyzer
The most popular Analyzer for Enlgish texts that includes the following rules for creating token which suites most European Languages (StandardTokenizer):
Splits words at punctuation characters, removing punctuation. However, a
dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.
It also inlcludes set of basic stops words which can be extended in different cases and contexts.
4. PorterAnalyzer
The Analyzer is based on Martin Porter stemming algorithm. In this approach, we tokenize texts based on all three stemming, filtering to lower-case and also filtering stop words. This approach is highly useful however it should be noted that it can create meaningless tokens such as "tire" for the word "tired". This is no problem as Analyzer are used in fetching documents and do not affect the response which user will read afterwards.
5. StandardBgramAnalyzer
The term bgram refers to subsequence of 2 characters from a string character. The Analyzer creates its index longer than other Analyzers. While it filters the text to lower case, in order to keep stop words in the index set, an underscore character "_" is placed between first and second part of the tokens. Start and End of a string will be indexed twice in this approach. The following is the tokens for "Bird on a wire":
bird
bird_on
on_a
a_wire
wire
It should be noted that idea of forming bgrams is based on finite-state machines. Lastly, it should be remembered that the big advantage of this approach is that it can match the queries which include stop words more precisely than others analyzers.
Application
Since we are only starting, we will just start with setting pom.xml and building the application dependencies and we continue in the next blog post.
Have you notices that in the element there are some additional arugement (lines 22-33) ? Yes, check out this awesome article. We will use this in next posts to make the running process of our application really easier than before.
Now just run the following and we are finished for today! ;)
Nice post. However, the POM file is broken: groupid and artifactid should be groupId and artifactId. Easy to spot if you know Maven, but newbies may trip on it. :-)
Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end. Java training in Chennai
I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.. Believe me I did wrote an post about tutorials for beginners with reference of your blog.
Nice post. However, the POM file is broken: groupid and artifactid should be groupId and artifactId. Easy to spot if you know Maven, but newbies may trip on it. :-)
ReplyDeleteAlso, it's missing a :-)
ReplyDeleteAlso, it's missing a :-)
ReplyDeleteGreat Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
ReplyDeleteJava training in Chennai
Java training in Bangalore
I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing..
ReplyDeleteBelieve me I did wrote an post about tutorials for beginners with reference of your blog.
Selenium training in bangalore
Selenium training in Chennai
Selenium training in Bangalore
Selenium training in Pune
Selenium Online training