サクサク読めて、アプリ限定の機能も多数!
トップへ戻る
WWDC24
blog.mikemccandless.com
Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if yo
Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago. In Lucene we have several different suggest implementations, under the suggest module; today I'm describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).
Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it. Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD w
FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool: That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinati
There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be. FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance fr
If you've ever wondered how Lucene picks segments to merge during indexing, it looks something like this: That video displays segment merges while indexing the entire Wikipedia (English) export (29 GB plain text), played back at ~8X real-time. Each segment is a bar, whose height is the size (in MB) of the segment (log-scale). Segments on the left are largest; as new segments are flushed, they appe
While indexing, Lucene periodically merges multiple segments in the index into a single larger segment. This keeps the number of segments relatively contained (important for search performance), and also reclaims disk space for any deleted docs on those segments. However, it has a well known problem: the merging process evicts pages from the OS's buffer cache. The eviction is ~2X the size of the m
このページを最初にブックマークしてみませんか?
『Changing Bits』の新着エントリーを見る
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く