Author: Andrei Fortuna
Date: 14:10:11 06/11/02
First some background info : I have some ideas on rewriting my search engine some time later this year - I realized that all I need to do I can do with php and mysql and I can do it personalized for each user this way - including which posts to ignore, moving posts to different folders, color and display schemes ... basically what I had in my offline browser but much much better and improved and easier to code due to the combination php+mysql. On the other hand the only thing I need to write in C/C++ is the part that keeps the word lists, just a program that has internally those lists and has as input a phrase/search expression and returns a list of article id for articles containing this. So far so good but my desire is to make the search as complex as possible. Until now my word search has been like in +abc* -xyz* i.e. I couldn't afford to have 1) phrase searches as in "best move" 2) expressions like "abc*xx" or "abc?x" 3) expressions like "*abc?xx" (words starting with anything) 4) case insensitive searches (I kept the word list as case sensitive so a case insensitive search I thought might be expensive - the alternative was to keep all case insensitive but then case sensitive searches would have not worked as expected) Now I would like to have all those 4 cases covered. 1) would mean to store for a word not only the articles it appears in but also the position(s) in which it appears, so a phrase with many words would have to have those words with consecutive indices. For 4) if I store a checksum for each word I can identify quickly a list of words and do a case insensitive strcmp for them . The part I'm having trouble is points 2) and 3) -> having * inside the word (with letters following) or especially as the first character in the search word. I confess I searched last year the net for algorithms for it and came with not very satisfying results ... so I'm asking all the bright minds in here for pointers, algorithms, ideas, buzz-words that I should use for an internet search ... I'm 100% certain this is a well known problem but I have no direction to search further and I would really like to incorporate a more complex search in my future engine ... please help a programmer fellow in distress :))) To detaliate more - until now I used two programs, one a server and one a client, the server was the one always loaded and listening to a port, the client got the request from the web page and contacted the server on that port and made the query ... now I want to eliminate the server alltogether, the only reason I had it was that if I did it all in the client loading indexes into memory would take too long time ... but I plan to keep indexes for words starting with 'a' in a file, for words starting with 'b' in another file etc .. so it could be done by a client alone.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.