In this article, it will be shown how to use NoSQL databases to extract high-quality and low-cost solution for searching and to help understand what impacts on NoSQL system selection.
Key words: NoSQL, search frameworks,method, search tools.
Nowadays there are lots of known search engines, for example, Google and Bing where we enter our search queries and rapidly get appropriate list of data. However, there are not many search tools adjusted individually for enterprise level companies and lack of search database applications solving domain focused problems. NoSQL databases provides a capacity for integrating high-quality search into a database application.
Let us start this article by defining what search terms are, and present some concepts which are used by search frameworks.
Search is one of the most important tools for the productivity of mental workers. To work, we will determine the search for the element of interest in NoSQL database, if you have at least some information about this element. For example, you can search for some keywords in the document, or make a search by the name of the document, author or creation date.
Search technologies are applied to the same structured records used in the RDBMS, as well as to semi-structured and “unstructured” text documents containing paragraphs and pictures.
By quickly searching for the document you need, you can save your time for work. Companies like Google and Yahoo! in using NoSQL systems, faced with problems related to search.
In this paper, firstly we will define some terms used to create search applications, then how NoSQL systems can be used to create search solutions.
Whenever to build an application, there is a point of providing search, which is important to the users. The types of search are following:
– Boolean search used in RDBMSs
– Full-text keyword used in search frameworks
– Structured search used in NoSQL
Boolean search is created by executing a query, looking for specific rows in a relational database. Search frameworks, for instance, Apache Lucene or Apache Solr are aimed to find a specific document which contains full-text keyword. In NoSQL there is a new search called structured search, which comprises of both boolean and full-text keyword search [1]. Detailed comparison is shown in table 1.
Table 1
A comparison of Boolean, full-text keyword, and structured search
Search type |
Representation of used structures |
Combines conditional operations and full-text |
Applied for |
Boolean search allows users to combine keywords with AND, NOT and OR to further produce more relevant results. |
Tables of rows and columns, where each row corresponds to a single record. |
No |
Strictly-structured data |
Full-text keyword search set of techniques for searching a single computer-stored document or a collection in a full-text database |
Documents, keywords |
No |
Unstructured text files |
Structured search combined from boolean and full-text keyword searches |
XML or JSON |
Yes |
Semi-structured documents |
The disadvantage of Boolean search systems is that they can return only specific result. As they work only on a highly-structured data, consequently, they either find the corresponding record or don’t. To look for a record you have to know exactly what parameters you are waiting for as a response and consider errors. That is, no ‘fuzzy match’ is applicable.
When it comes to the full-text keyword search it’s hard to narrow the search by the properties of a document. For instance, in many searching frameworks there is no allowance to limit searching to include documents created by a specific author or during the specific period of time or to sort the result by the time of document creation. If to use a structured search, there is an availability of getting the best of both searches explained before in a single search query.
When a NoSQL system is selected, it’s essential to take a look at find-ability of the system. That implies on the database characteristics, which are helpful in finding records that users look for. Here are a few search types [2] used in No SQL can be included into a system:
– Full-text search is the method of finding records that contain normal dialect content such as English. It is suitable when your information has free-form content like in an editorial or a book. Techniques of this search contains functions to remove insignificant “and, or, the” words and stemming (removing suffixes from word).
– Semi-structured search is applicable when it comes to looking for data which has rigid structure or full-text sentences. For illustration, a receipt for hours worked on a project might have long sentences depicting the assignments that were performed on an extend. A deals arrange might contain a full-text depiction of items within the arrange. A trade necessities report might have organized areas for who asked a feature, what discharge it'll be in, and a full-text portrayal of what the highlight will do.
– Geographic search is the process of making changes on a search result ranking based on of geographic distance. For instance, when the user looks for a restaurant within five-minute drive.
– Network search is the process of making changes on a search result ranking based on data found in graphs such as social networks. For example, when the user looks for a restaurant which was ranked by a friend from social network with the rank higher than 4.
Next, we look at the strategies and methods [3] that make NoSQL search engines effective, can receive the requested search information, and quickly return results.
– A range index is a way of indexing in increasing order all the values of the database items. Range indexes are ideal for keywords in alphabetical order, timestamps or sums, between two values. For any data type, range indices can be created if the data type has a logical way of sorting items. And for images or full-text paragraphs, you should not create a range index.
– A reverse index is a list of words documented in alphabetical order with an indication of the position and other occurrences of a document word. A structure similar to the index at the end of the book, which is listed in alphabetical order with the numbers of the pages on which the recording takes place. Using the index, you can quickly see and quickly jump to any entry that uses the term in the book that you need. In search software, reverse indexes are used in the same way. For each word in the document collection, there is a list of all documents that contain this word. These will help to speed up the search for documents containing keywords. There are search platforms that are designed to create and maintain reverse indexes for large document collections, such as Apache Lucene.
– Search ranking is a way of sorting results by user’s request based on the probability of matching what the user is looking for. Search ranking is a way of sorting results by user’s request based on the probability of matching what the user is looking for. Then it searches for a document that has a higher keyword density of the requested word, applying various criteria for determining the quality of the page and the resource as a whole, and gives the result to the user. The term keyword density is determined by how often a word appears in a document, weighted by the size of the document. That is, longer documents with a large number of keywords will have a higher rating. Therefore, so that longer documents do not always appear first in the search results, the number of keywords in the document and the total number of words in the document should be taken into account. Ranking algorithms may take into account other factors, such as the type of document, recommendations from your social networks, and relevance to a particular task.
– Stemming is finding the basis of a word, that is, the part that conveys its lexical meaning and will find not only the words themselves from the query, but also their various forms that may interest the user. For example, if a person entered the word “cat” then the search engines will show him at the same time those pages where the words “cats, catlike, and catty” are.
– A synonym extension is a search method, that includes synonyms for specific keywords in search results. For example, when searching for the keyword enter the word aspirin, then in the results the chemical names of the drugs salicylic acid and acetylsalicylic acid can be added. A good example of using the synonym extension is the WordNet database.
– Entity extraction is the process of finding and marking named entities in your text. Objects such as dates, people’s names, organizations, geolocations, and product names can be types of objects that should be marked with an object extraction program. The most common way to mark text is to use XML wrapper elements. Native XML databases, such as MarkLogic, provide functions for automatically finding and marking objects in your text.
– Pattern matching is the process of adding special characters to a search word so that multiple results match your query. For example, you can search for the word dog at the end using the symbol? Instead of one unknown letter: Dog? and the result will match words such as dog, dogs, or dogged. Many search engines do not support this type of search and, since adding support for characters doubles the size of stored indexes.
– A proximity search is a way of finding two or more words that are next to other words in a document. Here you can indicate that you are interested in all documents in which the dog and love are within 20 words of each other. Reports containing these words nearer to one another will get a higher rating in the brought outcomes back.
– Key-word-in-context libraries are instruments that assist you with adding keyword featuring to each output. This is generally done by including a wrapper around the keywords in the subsequent archive sections on the query items page.
In this article, it was talked about the search types and strategies that can be used in a NoSQL database and the methods by which searching can be done fast in NoSQL systems. Also it was covered how document structure affects the searching quality.
In general, usage of NoSQL databases is encouraged for an enterprise-level companies data, which are usually unstructured and can be huge.
References:
1. Karen Schuler, Cathleen P. Peterson, Eva Vincze. — Creating and Managing an Enterprise wide Program, Chapter 8 — Data Identification and Search Techniques.: Syngress, 2009.
2. Search engine NoSQL databases—https://bi-insider.com/posts/search-engine-nosql-database/, 2019.
3. A Fowler, A., — NoSQL For Dummies.: Wiley, 2015.