Using apache lucene

1/7/2024

Search with machine learning outperforms basic keyword search. Older technologies like Lucene that entirely rely on index matching typically need an army of people writing business rules to make up for an endless set of deficiencies. AI search technology provides a faster, automated feedback loop to improve search result ordering to maximize business goals such as purchases or engagement. To improve relevancy and results, language embeddings (vectors) and machine learning should be core to every search project. The days of relying on simple reverse index ranking algorithms are gone. This is fine for projects such as log analysis because logs don’t change (it’s one of the reasons that Elasticsearch is focused on this use case with its ELK stack) but less so for use cases like e-commerce search with data that changes frequently. Periodically the memory buffer fills up, the difference is reconciled and all the files with any changes get merged and rewritten out to disk. So if you make a small change to an item, it basically stores a flag to say that is deleted and creates an entire new record in memory. This slows down queries as it needs to run the search, then check the differential for changes on the way out. Whenever a record changes in your database, Lucene will store the new value, but it still hangs onto the old value. Lucene has several key deficiencies that will become more apparent in time: ‍ Other new search competitors have entered the market, too, further pushing Lucene to the side. We built Sajari because we felt Lucene couldn’t deliver on our key goals, which were based around fully real-time data updates and complex machine learning based retrieval. (However, due to recent licensing changes, it is questionable whether Elasticsearch remains open source.) Both Solr and Elasticsearch are Apache open source projects as well and benefit from a large community of developers. They also allow Lucene to scale to very large datasets as a distributed system (something Lucene lacks). Both of these Lucene derivatives are essentially full-featured wrappers that have hidden Lucene’s libraries behind more powerful and easier to implement APIs. In fact, it was the lack of some of the basic search features that allowed two newer projects, Solr and Elasticsearch, to flourish.

But if you need something more full-featured and modern, there are similar options.

Enterprises may like that it’s written in Java, too, for tying into legacy projects. If you’re looking for a free, open source library and SDK to build search, Lucene is not a bad choice. What it lacks can be found in the many public libraries that fill in missing functionality such as crawling. Lucene’s core libraries offer just about everything you need to build a search application. With its exceptional documentation and large community, Lucene remains a search workhorse. Lucene builds an inverted index of your data for full text information retrieval - essentially it indexes your data by keyword - and provides libraries for features such as typo tolerance, sorting, ranking, and much more. The Lucene open source software project was first released in 1999 and later added to the Apache Foundation in 2005.

0 Comments

Using apache lucene

Leave a Reply.

Author

Archives

Categories