1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 全文搜索引擎的比较-Lucene Sphinx Postgresql MySQL?

全文搜索引擎的比较-Lucene Sphinx Postgresql MySQL?

时间:2022-09-26 12:20:32

相关推荐

全文搜索引擎的比较-Lucene Sphinx Postgresql MySQL?

本文翻译自:Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

I'm building a Django site and I am looking for a search engine.我正在建立Django网站,并且正在寻找搜索引擎。

A few candidates:一些候选人:

Lucene/Lucene with Compass/SolrLucene / Lucene与指南针/ Solr

Sphinx狮身人面像

Postgresql built-in full text searchPostgreSQL内置全文本搜索

MySQl built-in full text searchMySQl内置全文本搜索

Selection criteria:选择标准:

result relevance and ranking结果相关性和排名searching and indexing speed搜索和索引速度ease of use and ease of integration with Django易于使用,易于与Django集成resource requirements - site will be hosted on a VPS , so ideally the search engine wouldn't require a lot of RAM and CPU资源需求-网站将托管在VPS上 ,因此理想情况下,搜索引擎不需要大量的RAM和CPUscalability可扩展性extra features such as "did you mean?", related searches, etc其他功能,例如“您的意思是?”,相关搜索等

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.任何对以上搜索引擎或其他不在列表中的引擎有经验的人-我很想听听您的意见。

EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously.编辑:至于索引需求,随着用户不断向站点输入数据,这些数据将需要连续索引。It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay它不一定是实时的,但是理想情况下新数据将以不超过15-30分钟的延迟显示在索引中

#1楼

参考:/question/35nX/全文搜索引擎的比较-Lucene-Sphinx-Postgresql-MySQL

#2楼

I'm looking at PostgreSQL full-text search right now, and it has all the right features of a modern search engine, really good extended character and multilingual support, nice tight integration with text fields in the database.我现在正在看PostgreSQL全文搜索,它具有现代搜索引擎的所有正确功能,非常好的扩展字符和多语言支持,与数据库中的文本字段紧密集成。

But it doesn't have user-friendly search operators like + or AND (uses & | !) and I'm not thrilled with how it works on their documentation site.但是它没有像+或AND这样的用户友好的搜索运算符(使用&|!),我对它们在其文档站点上的工作方式并不感到兴奋。While it has bolding of match terms in the results snippets, the default algorithm for which match terms is not great.尽管结果片段中的匹配项用粗体显示,但匹配项的默认算法并不理想。Also, if you want to index rtf, PDF, MS Office, you have to find and integrate a file format converter.另外,如果要为rtf,PDF,MS Office编制索引,则必须查找并集成文件格式转换器。

OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer.OTOH,它比MySQL文本搜索更好,后者甚至不索引三个字母或更少的单词。It's the default for the MediaWiki search, and I really think it's no good for end-users: /analysis/mediawiki-search/这是MediaWiki搜索的默认设置,我真的认为这对最终用户不利: http : ///analysis/mediawiki-search/

In all cases I've seen, Lucene/Solr and Sphinx are really great.在所有情况下,Lucene / Solr和Sphinx都很棒。They're solid code and have evolved with significant improvements in usability, so the tools are all there to make search that satisfies almost everyone.它们是可靠的代码,并且在可用性方面有了显着的改进,因此已经有了足够的工具来使几乎所有人都满意的搜索。

for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine.对于SHAILI-SOLR包括Lucene搜索代码库,并且具有成为一个不错的独立搜索引擎的组件。

#3楼

SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."SearchTools-Avi说:“ MySQL文本搜索,甚至不索引三个字母或更少的单词。”

FYIs, The MySQL fulltext min word length is adjustable sinceat leastMySQL 5.0.仅供参考,至少从MySQL 5.0起,MySQL全文的最小字长是可调的。Google 'mysql fulltext min length' for simple instructions.谷歌“ mysql全文最小长度”的简单说明。

That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...就是说,MySQL全文具有局限性:一方面,一旦达到一百万条左右的记录,更新就会变慢,...

#4楼

I would add mnoGoSearch to the list.我将mnoGoSearch添加到列表中。Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality.极为高效且灵活的解决方案,可像Google一样工作:索引器可从多个站点获取数据,您可以使用基本条件,也可以发明自己的挂钩来获得最佳搜索质量。Also it could fetch the data directly from the database.它还可以直接从数据库中获取数据。

The solution is not so known today, but it feets maximum needs.该解决方案今天尚不为人所知,但它满足了最大需求。You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.您可以编译并安装它,也可以在独立服务器上,甚至在您的主体服务器上,它都不需要Solr这样的资源,因为它是用C编写的,甚至可以在小型服务器上完美运行。

In the beginning You need to compile it Yourself, so it requires some knowledge.首先,您需要自己编译,因此需要一些知识。I made a tiny script for Debian, which could help.我为Debian 编写了一个小脚本 ,可以帮上忙。Any adjustments are welcome.欢迎任何调整。

As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles .当您使用Django框架时,您可以在中间使用或PHP客户端,或者在Python中找到解决方案,我看到了一些 文章 。

And, of course mnoGoSearch is open source, GNU GPL.而且,mnoGoSearch当然是开源的GNU GPL。

#5楼

Just my two cents to this very old question.对于这个非常老的问题,只有我两分钱。I would highly recommend taking a look at ElasticSearch .我强烈建议您看一下ElasticSearch 。

Elasticsearch is a search server based on Lucene.Elasticsearch是基于Lucene的搜索服务器。It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents.它提供了具有RESTful Web界面和无模式JSON文档的分布式,多租户的全文本搜索引擎。Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.Elasticsearch是用Java开发的,并根据Apache许可的条款作为开源发布。

The advantages over other FTS (full text search) Engines are:与其他FTS(全文搜索)引擎相比,其优势在于:

RESTful interfaceRESTful接口Better scalability更好的可扩展性Large community大型社区Built by Lucene developers由Lucene开发人员构建Extensive documentation广泛的文档There are many open source libraries available (including Django)有许多可用的开源库(包括Django)

We are using this search engine at our project and very happy with it.我们在项目中使用了此搜索引擎,对此感到非常满意。

#6楼

ApacheSolr阿帕奇·索尔(ApacheSolr)

Apart from answering OP's queries, Let me throw some insights onApache Solrfromsimple introductiontodetailed installationandimplementation.除了回答OP的查询之外,让我从简单的介绍详细的安装实现,Apache Solr进行一些分析。

Simple Introduction简单介绍

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.任何对以上搜索引擎或其他不在列表中的引擎有经验的人-我很想听听您的意见。

Solrshouldn't be used to solve real-time problems.Solr不应用于解决实时问题。For search engines,Solris pretty much game and worksflawlessly.对于搜索引擎而言,Solr几乎是一款游戏,并且可以完美运行。

Solrworks fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement).Solr在“高流量” Web应用程序上运行良好(我在某处读到它不适合此操作,但我正在备份该声明)。It utilizes the RAM, not the CPU.它利用RAM,而不是CPU。

result relevance and ranking结果相关性和排名

Theboosthelps you rank your results show up on top.增强功能可帮助您将结果排名显示在最前面。Say, you're trying to search for a namejohnin the fieldsfirstnameandlastname, and you want to give relevancy to thefirstnamefield, then you need toboostup thefirstnamefield as shown.假设您要在firstname和lastname字段中搜索john姓名,并且想要与firstname字段相关,那么您需要如图所示增强firstname字段。

http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john

As you can see,firstnamefield isboostedup with a score of 2.如您所见,名字字段的得分提高了2。

More on SolrRelevancy有关Solr相关性的更多信息

searching and indexing speed搜索和索引速度

The speed is unbelievably fast and no compromise on that.速度之快令人难以置信,并且对此没有任何妥协。The reason I moved toSolr.我之所以搬到Solr的原因。

Regarding the indexing speed,Solrcan also handleJOINSfrom your database tables.关于索引速度,Solr还可以处理数据库表中的JOINS。A higher and complexJOINdo affect the indexing speed.较高且复杂的JOIN确实会影响索引编制速度。However, an enormousRAMconfig can easily tackle this situation.但是,巨大的RAM配置可以轻松解决这种情况。

The higher the RAM, The faster the indexing speed of Solr is.RAM越高,Solr的索引速度越快。

ease of use and ease of integration with Django易于使用,易于与Django集成

Never attempted to integrateSolrandDjango, however you can achieve to do that with Haystack .从未尝试过将SolrDjango集成在一起,但是可以使用Haystack做到这一点。I found some interesting article on the same and here's the github for it.我在同一篇文章中找到了一些有趣的文章 ,这是它的github 。

resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU资源需求-网站将托管在VPS上,因此理想情况下,搜索引擎不需要大量的RAM和CPU

Solrbreeds on RAM, so if the RAM is high, you don't to have to worry aboutSolr.Solr在RAM上繁殖,因此,如果RAM高,则不必担心Solr

Solr'sRAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation.如果您有数十亿条记录,Solr的RAM使用率会随着完全索引的增加而增加,您可以聪明地利用Delta导入来解决这种情况。As explained,Solris only a near real-time solution.如前所述,Solr只是近乎实时的解决方案。

scalability可扩展性

Solris highly scalable.Solr具有高度可扩展性。Have a look on SolrCloud .看看SolrCloud 。Some key features of it.它的一些关键功能。

Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large)分片(或分片是在多台计算机之间分配索引的概念,比如说索引是否太大)Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism)负载平衡(如果Solrj与Solr云一起使用,它将使用其Round-Robin机制自动处理负载平衡)Distributed Search分布式搜索High Availability高可用性

extra features such as "did you mean?", related searches, etc其他功能,例如“您的意思是?”,相关搜索等

For the above scenario, you could use the SpellCheckComponent that is packed up withSolr.对于上述情况,你可以使用SpellCheckComponent是挤满了Solr的。There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed,booksinstead ofbook, you will be presented with results related tobook.还有很多其他功能, SnowballPorterFilterFactory有助于检索记录,例如,如果您键入的是书籍而不是book,那么将显示与book相关的结果。

This answer broadly focuses onApache Solr&MySQL.这个答案主要集中在Apache SolrMySQL上。Django is out of scope.Django超出范围。

Assuming that you are under LINUX environment, you could proceed to this article further.假设您在LINUX环境下,则可以继续阅读本文。(mine was an Ubuntu 14.04 version)(我的是Ubuntu 14.04版本)

Detailed Installation详细安装

Getting Started入门

DownloadApache Solrfrom here .从这里下载Apache Solr。That would be version is4.8.1.那将是4.8.1版本。You could download new versions, I found this stable.您可以下载新版本,我发现这很稳定。

After downloading the archive , extract it to a folder of your choice.下载存档后,将其解压缩到您选择的文件夹中。Say ..Downloadsor whatever.. So it will look likeDownloads/solr-4.8.1/说..Downloads或其他内容。所以它看起来像Downloads/solr-4.8.1/

On your prompt.. Navigate inside the directory在提示符下..浏览目录

shankar@shankar-lenovo: cd Downloads/solr-4.8.1

So now you are here ..所以现在你在这里..

shankar@shankar-lenovo: ~/Downloads/solr-4.8.1$

Start the Jetty Application Server启动Jetty应用服务器

Jettyis available inside the examples folder of thesolr-4.8.1directory , so navigate inside that and start the Jetty Application Server.在solr-4.8.1目录的examples文件夹中可以找到Jetty,因此请在其中导航并启动Jetty Application Server。

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar

Now , do not close the terminal , minimize it and let it stay aside.现在,不要关闭端子,将其最小化并放在一边。

( TIP : Use & after start.jar to make the Jetty Server run in the background )(提示:在start.jar之后使用&可使Jetty Server在后台运行)

To check ifApache Solrruns successfully, visit this URL on the browser.要检查Apache Solr是否成功运行,请在浏览器上访问此URL。http://localhost:8983/solrhttp:// localhost:8983 / solr

Running Jetty on custom Port在自定义端口上运行码头

It runs on the port 8983 as default.默认情况下,它在端口8983上运行。You could change the port either here or directly inside thejetty.xmlfile.您可以在此处或直接在jetty.xml文件内部更改端口。

java -Djetty.port=9091 -jar start.jar

Download the JConnector下载JConnector

This JAR file acts as a bridge betweenMySQLand JDBC , Download the Platform Independent Version here此JAR文件充当MySQL和JDBC之间的桥梁,请在此处下载独立于平台的版本

After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jarand paste it to thelibdirectory.下载后,解压缩该文件夹并复制mysql-connector-java-5.1.31-bin.jar并将其粘贴到lib目录中。

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib

Creating the MySQL table to be linked to Apache Solr创建要链接到Apache Solr的MySQL表

To putSolrto use, You need to have some tables and data to search for.要使用Solr,您需要搜索一些表和数据。For that, we will useMySQLfor creating a table and pushing some random names and then we could useSolrto connect toMySQLand index that table and it's entries.为此,我们将使用MySQL创建表并推入一些随机名称,然后使用Solr连接到MySQL并对该表及其条目进行索引。

1.Table Structure1.表结构

CREATE TABLE test_solr_mysql(id INT UNSIGNED NOT NULL AUTO_INCREMENT,name VARCHAR(45) NULL,created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id));

2.Populate the above table2.填充上表

INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');

Getting inside the core and adding the lib directives深入内核并添加lib指令

1.Navigate to1.导航到

shankar@shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf

2.Modifying the solrconfig.xml2,修改solrconfig.xml

Add these two directives to this file..将这两个指令添加到此文件。

<lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" /><lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />

Now add theDIH(Data Import Handler)现在添加DIH(数据导入处理程序)

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" ><lst name="defaults"><str name="config">db-data-config.xml</str></lst></requestHandler>

3.Create the db-data-config.xml file3.创建db-data-config.xml文件

If the file exists then ignore, add these lines to that file.如果文件存在,则忽略,将这些行添加到该文件。As you can see the first line, you need to provide the credentials of yourMySQLdatabase.如第一行所示,您需要提供MySQL数据库的凭据。The Database name, username and password.数据库名称,用户名和密码。

<dataConfig><dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/><document><entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'OR `created` > '${dataimporter.last_index_time}'" ><field name="id" column="rid" /><field name="solr_name" column="name" /></entity></document></dataConfig>

( TIP : You can have any number of entities but watch out for id field, if they are same then indexing will skipped. )(提示:您可以有任意数量的实体,但要注意id字段,如果它们相同,则将跳过索引。)

4.Modify the schema.xml file4,修改schema.xml文件

Add this to yourschema.xmlas shown..如图所示,将其添加到您的schema.xml中。

<uniqueKey>id</uniqueKey><field name="solr_name" type="string" indexed="true" stored="true" />

Implementation实作

Indexing索引编制

This is where the real deal is.这才是真正的交易。You need to do the indexing of data fromMySQLtoSolrinorder to make use of Solr Queries.您需要对从MySQLSolr的数据进行索引,以利用Solr查询。

Step 1: Go to Solr Admin Panel第1步:转到Solr管理面板

Hit the URL http://localhost:8983/solr on your browser.在浏览器中点击URL http:// localhost:8983 / solr 。The screen opens like this.屏幕将像这样打开。

As the marker indicates, go toLogginginorder to check if any of the above configuration has led to errors.如标记所示,请转到“日志记录”以检查以上任何配置是否导致错误。

Step 2: Check your Logs第2步:检查您的日志

Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS).好的,现在您在这里,您将可以看到很多黄色消息(警告)。Make sure you don't have error messages marked in red.确保您没有将错误消息标记为红色。Earlier, on our configuration we had added a select query on ourdb-data-config.xml, say if there were any errors on that query, it would have shown up here.之前,在我们的配置中,我们在db-data-config.xml上添加了一个选择查询,说如果该查询有任何错误,它将显示在这里。

Fine, no errors.很好,没有错误。We are good to go.我们很好。Let's choosecollection1from the list as depicted and selectDataimport让我们从如图所示的列表中选择collection1,然后选择Dataimport

Step 3: DIH (Data Import Handler)步骤3:DIH(数据导入处理程序)

Using the DIH, you will be connecting toMySQLfromSolrthrough the configuration filedb-data-config.xmlfrom theSolrinterface and retrieve the 10 records from the database which gets indexed ontoSolr.使用DIH,您将通过从Solr的接口配置文件DB数据-config.xml中连接到MySQLSolr的和检索其编入索引到Solr的数据库中的10条记录。

To do that, Choosefull-import, and check the optionsCleanandCommit.为此,选择“完全导入”,然后选中“清除并提交”选项。Now clickExecuteas shown.现在,如图所示,单击执行

Alternatively, you could use a directfull-importquery like this too..另外,您也可以像这样使用直接的完全导入查询。

http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true

After you clickedExecute,Solrbegins to index the records, if there were any errors, it would sayIndexing Failedand you have to go back to theLoggingsection to see what has gone wrong.单击Execute之后Solr开始对记录进行索引,如果有任何错误,它将显示Indexing Failed,并且您必须返回到Logging部分以查看出现了什么问题。

Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.假设此配置没有错误,并且索引成功完成,您将收到此通知。

Step 4: Running Solr Queries步骤4:运行Solr查询

Seems like everything went well, now you could useSolrQueries to query the data that was indexed.似乎一切顺利,现在您可以使用Solr查询来查询已索引的数据。Click theQueryon the left and then pressExecutebutton on the bottom.单击左侧的查询,然后按底部的执行按钮。

You will see the indexed records as shown.您将看到所示的索引记录。

The correspondingSolrquery for listing all the records is用于列出所有记录的相应Solr查询为

http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true

Well, there goes all 10 indexed records.好吧,这里有所有10个索引记录。Say, we need only names starting withJa, in this case, you need to target the column namesolr_name, Hence your query goes like this.说,我们只需要以Ja开头的名称,在这种情况下,您需要定位列名称solr_name,因此查询如下。

http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true

That's how you writeSolrQueries.这就是您编写Solr查询的方式。To read more about it, Check this beautiful article .要了解更多信息,请查看这篇精美的文章 。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。