|
Understanding
computer searches
The simple search in perl is a
very basic realtime search. It actaully reads the files and
checks for matches. This is a great tool because it is always up
to date and can read files in real time. Even if files have been
recently modified, the search will read the current pages.
As you know, search engines
are not quite the same. Since it is impossible to read billions
of files each time a search is submitted, the results are preset
or created in advance. It will often take considerable resources
to create an index that has preset searches.
In cases where there are vast
amounts of data to search it is not uncommon to run servers in
the background that update the indexes as files are modified.
Searching files is one of the
most basic functions yet one of the most depended on features of
any website or database.
In database searches, the data
can be more categorized so smaller numbers of records and easier
to access than large pages.
The database design is usually
closely related to the search process. By segmenting file data
into smaller chunks like a product name you could search 100,000
names in one file. The name may be a key to the actual data file
and if a name match is found the search script can open the
datafile with that name to find more specific data.
The bigger the number of files
the more complex the script will get. One million data records
could be broken into 10 million parts to create faster searches.
For example, it is not
necessary to search every line of every file for a phone number.
If you have a file named for the actual phone number you can open
it directly, or at least use it to reference the full datafile.
Since all large searches are
custom written there is no way of telling how the programer is
making it work. But it does get quite complicated and much more
extensive than the simple search we have provided here.
You should understand the
limitations of the simple search and not try to run it on a
million pages. If you get to that point, you will need something
much more advanced since you can't open and process a million
files very quickly.
In most cases with site
searches the pages are converted to datafiles and an inverted
index is created to list the relative pages for keyword searches.
This process can take long periods of time to compile and when
changes are made the index needs to be updated making realtime
searches more difficult.
Search engines like google
take weeks to update their indexes with the data spanning
thousands of servers. Searches can be simple or extremly complex,
but the concepts are still the same. Matching keywords in files.
|