Created on 01 Aug 2018 ;    Modified on 19 Sep 2018 ;    Translationitalian

Full text search with Python - N.1

Beginnig to investigate how to implement full text search in a WEB based application developed with Python/Django, I read that the modern versions of SQLite 3, the DBMS library [1] distributed with Python, support it.

They do this using a FTS support library [2], to be used in versions 3 and later (FTS3, FTS4 and FTS5, at the time of writing; versions 1 and 2 must be avoided because bugged).

Also, always at the time of writing, at least FTS3 is supported by SQLite 3 distributed with Python 3.6 on MS Windows 10 system (64 bit) [3].

I often work with SQLite on Python using Django. So I decided to try this feature to check how to use it. And I observed that it is definitely simple.

For example, without being scared of the length, let's take the program sqlite_fts.py of which I quote the listing at the end of the article: Listing of sqlite_fts.py.

The create_table (db, tname) function (lines 102-117) creates two tables with fixed structure (for simplicity of coding):

  • the table {tname} whose record has the fields id/name/profile; where we use name as key; while we will do the research on profile;
  • a FTS3 virtual table named v_{tname} [4] whose record has fields Name/profile; the first we will use as a link to the record in the table {tname}, the second we are going to feed with the text that will be the object of the full text search.

The table v_{tname} is special: it feeds the appropriate managed indexes from SQLite 3/FTS3 to perform free text searches. Its primary purpose is not to record information, but only to feed the indexes. If necessary you can also use it to record information. But in our test program we do it in {tname} table.

The insert (db, tname, persons) function (lines 64-75) inserts some people in the DB. In particular, it performs two insertions:

  • at line 70 it inserts in the table {tname}, the one made to record data;
  • to line 71 it inserts the same data in the virtual table v_ {tname}, the one that must feed the indexes for free text search.

The fts (db, tname, what) function (lines 90-99) does the full text search, made with a SELECT on the virtual table v_ {tname} using the MATCH operator. Line 95 asks to match the string what in the profile field. The string what can be a word, as a sentence.

The rest of the code is routine. In particular, the main () executes the opening of the DB [5] in line 193; and in lines 200-212 executes an infinite loop to interactively execute operator commands.

The execution of the commands is done due the cmd_exe (db, cmd) function (lines 142-172).

The other functions are support: drop tables, list of records, and so on.

The data entered in the tables for testing, are embedded in the variable PERSONS (source lines 41-60).

An example of a test session

Let's take an example of a trial session.

Suppose we have created the sqlite_fts.py program in one temporary test directory: test. We perform the following steps:

>cd test                               # go to test directory
>mkdir data                            # create data dir for DB
>python sqlite_fts.py                  # execute program
data/db.sqlite3 bd opened
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? c             # create tables persons
table name to create? (^C to stop) persons
    creating table(s) persons and v_persons
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? i             # insert 5 records in tables (first 5 USA Presidents)
    inserting persons
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? l             # list persons (output cutted)
    listing persons
    George Washington: George Washington (February 22, 1732[b][c] – December 14, 1799) was the first President ...
    ... CUT ...
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts           # query for "virginia": 4 Presidents
full text search of what? (^C to stop) virginia
    searching virginia in v_persons
    George Washington
    Thomas Jefferson
    James Madison Jr.
    James Monroe
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts           # query for "virginia AND congress": 3 Presidents
full text search of what? (^C to stop) virginia  congress
    searching virginia  congress in v_persons
    Thomas Jefferson
    James Madison Jr.
    James Monroe
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts           # query for "virginia AND national AND congress": 2 Presidents
full text search of what? (^C to stop) virginia national congress
    searching virginia national congress in v_persons
    Thomas Jefferson
    James Monroe

Listing of sqlite_fts.py

  1 # sqlite_fts.py
  2 #    VERY BASIC: how manage sqlite3 using python AND do Full Text Search on it
  3 #
  4 # author: luciano de falco alfano
  5 # license CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/
  6 # date:   2018.08.31
  7 #
  8 # overall logic:
  9 #    - main() open DB and execute an infinite loop to process user commands
 10 #    - cmd_exe() execute user command:
 11 #          - create_table() create a table with fixed record structure, and its
 12 #                            counterpart virtual table even with fixed structure
 13 #          - drop_table() drop a single table
 14 #          - insert()     insert some record
 15 #          - list()       list all records
 16 #          - fts()        full text search on records, list result keys
 17 #
 18 # excerpts from:
 19 #     - pure sqlite example: https://www.pythoncentral.io/introduction-to-sqlite-in-python/
 20 #                            https://www.pythoncentral.io/advanced-sqlite-usage-in-python/
 21 #     - fts3 example:        http://www.rkblog.rk.edu.pl/w/p/fulltext-search-sqlite-and-django-app/
 22 #     - fts3 doc:            https://www.sqlite.org/fts3.html
 23 
 24 import pdb
 25 import sqlite3
 26 
 27 DB = 'data/db.sqlite3'
 28 
 29 # dictionary of commands
 30 DoCMDS = {'c': 'create table',
 31           'd': 'drop table',
 32           'fts': 'full text search',
 33           'h': 'help',
 34           'i': 'insert persons',
 35           'l': 'list persons',
 36           'q': 'quit',
 37          }
 38 CMDS = tuple([cmd for cmd in DoCMDS.keys()])
 39 
 40 # the first 5 presidents of USA, profile incipit from wikipedia
 41 PERSONS = [ ('George Washington',
 42     '''George Washington (February 22, 1732[b][c] – December 14, 1799) was the first President of the United States, and was among the nation's Founding Fathers. As commander-in-chief of the Continental Army during the American Revolutionary War, he was the principal force in victory, effected with the surrender of the British at Yorktown. In 1787 he presided over the Constitutional Convention which formed the new federal government. Since the late 1780s, Washington has been known as the "Father of His Country" by compatriots. Scholarly and public polling ranks him among the top three Presidents in history.
 43 
 44 Washington was born to a moderately prosperous family of planters, who owned slaves in colonial Virginia. He had early opportunities in education, learned mathematics and quickly launched a successful career as a surveyor, which in turn enabled him to make considerable land investments. He then joined the Virginia militia and fought in the French and Indian War. During the war with the British he also fought against one of their allies, the Iroquois nation. His devotion to American Republicanism impelled him to decline further power after victory, and he resigned as commander-in-chief in 1783. ''', ),
 45     ('John Adams',
 46     '''John Adams (October 30 [O.S. October 19] 1735 – July 4, 1826) was an American statesman and Founding Father who served as the first Vice President (1789–1797) and second President of the United States (1797–1801). He was a lawyer, diplomat, political theorist, and a leader of the movement for American independence from Great Britain. He was also a dedicated diarist and correspondent, particularly with his wife and closest advisor, Abigail.
 47 
 48 Adams collaborated with his cousin, revolutionary leader Samuel Adams, but established his own prominence prior to the American Revolution. Driven by his devotion to the right to counsel and the presumption of innocence, he defied local anti-British sentiment and provided a successful legal defense of the accused British soldiers involved in the Boston Massacre. Adams was sent as a delegate from colonial Massachusetts to the Continental Congress, where he played a leading role in persuading Congress to declare independence. He assisted in drafting the Declaration of Independence in 1776, and was its foremost advocate in Congress. As a diplomat in Europe, he helped negotiate the peace treaty with Great Britain and acquired vital governmental loans from Amsterdam bankers. Adams was the primary author of the Massachusetts Constitution in 1780, which influenced the development of America's own constitution, as did his earlier Thoughts on Government. ''',),
 49     ('Thomas Jefferson',
 50     '''Thomas Jefferson (April 13, [O.S. April 2] 1743 – July 4, 1826) was an American Founding Father who was the principal author of the Declaration of Independence and later served as the third president of the United States from 1801 to 1809. Previously, he had been elected the second vice president of the United States, serving under John Adams from 1797 to 1801. He was a proponent of democracy, republicanism, and individual rights motivating American colonists to break from Great Britain and form a new nation; he produced formative documents and decisions at both the state and national level.
 51 
 52 Jefferson was mainly of English ancestry, born and educated in colonial Virginia. He graduated from the College of William & Mary and briefly practiced law, with the largest number of his cases concerning land ownership claims.[1] During the American Revolution, he represented Virginia in the Continental Congress that adopted the Declaration, drafted the law for religious freedom as a Virginia legislator, and served as a wartime governor (1779–1781). He became the United States Minister to France in May 1785, and subsequently the nation's first Secretary of State in 1790–1793 under President George Washington. Jefferson and James Madison organized the Democratic-Republican Party to oppose the Federalist Party during the formation of the First Party System. With Madison, he anonymously wrote the controversial Kentucky and Virginia Resolutions in 1798–1799, which sought to strengthen states' rights by nullifying the federal Alien and Sedition Acts. ''',),
 53     ('James Madison Jr.',
 54     '''James Madison Jr. (March 16, 1751 – June 28, 1836)[2] was an American statesman and Founding Father who served as the fourth President of the United States from 1809 to 1817. He is hailed as the "Father of the Constitution" for his pivotal role in drafting and promoting the United States Constitution and the Bill of Rights.
 55 
 56 Born into a prominent Virginia planting family, Madison served as a member of the Virginia House of Delegates and the Continental Congress during and after the American Revolutionary War. In the late 1780s, he helped organize the Constitutional Convention, which produced a new constitution to supplant the ineffective Articles of Confederation. After the Convention, Madison became one of the leaders in the movement to ratify the Constitution, and his collaboration with Alexander Hamilton produced The Federalist Papers, among the most important treatises in support of the Constitution. ''',),
 57     ('James Monroe',
 58     '''James Monroe (/mənˈroʊ/; April 28, 1758 – July 4, 1831) was an American statesman and Founding Father who served as the fifth President of the United States from 1817 to 1825. Monroe was the last president of the Virginia dynasty, and his presidency ushered in what is known as the Era of Good Feelings. Born in Westmoreland County, Virginia, Monroe was of the planter class and fought in the American Revolutionary War. He was wounded in the Battle of Trenton with a musket ball to the shoulder. After studying law under Thomas Jefferson from 1780 to 1783, he served as a delegate in the Continental Congress.[1]
 59 
 60 As an anti-federalist delegate to the Virginia convention that considered ratification of the United States Constitution, Monroe opposed ratification, claiming it gave too much power to the central government. Nonetheless, he took an active part in the new government, and in 1790 was elected to the Senate of the first United States Congress, where he joined the Democratic-Republicans. He gained experience as an executive as the Governor of Virginia and rose to national prominence as a diplomat in France when he helped negotiate the Louisiana Purchase in 1803. During the War of 1812, Monroe served in critical roles as Secretary of State and the Secretary of War under President James Madison.[2] ''',), ]
 61 
 62 TABLE = ''
 63 
 64 def insert(db, tname, persons):
 65     '''insert a bunch of persons in a (empty) table'''
 66 
 67     try:
 68         cursor = db.cursor()
 69         for person in persons:
 70             cursor.execute(f'INSERT INTO {tname}(name, profile) VALUES(?,?)', person)
 71             cursor.execute(f'INSERT INTO v_{tname}(name, profile) VALUES(?,?)', person)
 72         db.commit()
 73     except Exception as ex:
 74         db.rollback()
 75         raise ex
 76 
 77 
 78 def list(db, tname):
 79     '''list of all persons in table'''
 80 
 81     try:
 82         cursor = db.cursor()
 83         cursor.execute(f'SELECT name, profile FROM {tname}')
 84         for row in cursor:
 85             print(f'    {row[0]}: {row[1]}')
 86     except Exception as ex:
 87         raise ex
 88 
 89 
 90 def fts(db, tname, what):
 91     '''full text search of something in virtual table v_{tname}'''
 92 
 93     try:
 94         cursor = db.cursor()
 95         cursor.execute(f'SELECT name FROM v_{tname} WHERE profile MATCH ?', (what,))
 96         for row in cursor:
 97             print(f'    {row[0]}')
 98     except Exception as ex:
 99         raise ex
100 
101 
102 def create_table(db, tname):
103     '''create a table with id/name/profile record
104 
105        note: create even a virtual table named v_{tname}'''
106 
107     try:
108         cursor = db.cursor()
109         cursor.execute(f'CREATE TABLE IF NOT EXISTS {tname}(\
110             id INTEGER PRIMARY KEY, \
111             name TEXT, \
112             profile TEXT)')
113         cursor.execute(f'CREATE VIRTUAL TABLE v_{tname} USING fts3(name, profile)')
114         db.commit()
115     except Exception as ex:
116         db.rollback()
117         raise ex
118 
119 
120 def drop_table(db, tname):
121     '''drop A SINGLE table
122 
123     note: tname and v_{tname} must be dropped separately
124       this is because in case of partial drop (one of two tables)
125       program could become stuck'''
126 
127     try:
128         cursor = db.cursor()
129         cursor.execute(f'DROP TABLE {tname}')
130         db.commit()
131     except Exception as ex:
132         db.rollback()
133         raise ex
134 
135 def help():
136     '''list commands'''
137 
138     for cmd in CMDS:
139         print(f'    {cmd}: {DoCMDS[cmd]}')
140     print()
141 
142 def cmd_exe(db, cmd):
143     ''' execute command, BUT not 'quit' '''
144 
145     global TABLE
146     try:
147         #pdb.set_trace()
148         if not cmd in CMDS:
149             raise ValueError(f'ERROR: cmd {cmd} unknown')
150         elif cmd == 'c': # create table
151             tname = input(f'table name to create? (^C to stop) ')
152             print(f'    creating table(s) {tname} and v_{tname}')
153             create_table(db, tname)
154             TABLE = tname[:]
155         elif cmd == 'd': # drop table
156             tname = input(f'table name to drop? (^C to stop) ')
157             print(f'    deleting table {tname}')
158             drop_table(db, tname)
159         elif cmd == 'fts': # full text search
160             what = input(f'full text search of what? (^C to stop) ')
161             print(f'    searching {what} in v_{TABLE}')
162             fts(db, TABLE, what)
163         elif cmd == 'i': # insert persons
164             print(f'    inserting persons')
165             insert(db, TABLE, PERSONS)
166         elif cmd == 'l': # list persons
167             print(f'    listing persons')
168             list(db, TABLE)
169         else:
170             raise ValueError(f'WARNING: cmd {cmd} unsupported ')
171     except Exception as ex:
172         raise ex
173 
174 def get_table(db):
175     '''return 1st table name not starting with v_ prefix'''
176 
177     try:
178         cursor = db.cursor()
179         cursor.execute("SELECT name FROM sqlite_master WHERE type = 'table'")
180         tables = []
181         for row in cursor:
182             if row[0].split('_')[0]=='v':
183                 continue
184             else:
185                 return row[0]
186     except Exception as ex:
187         raise ex
188 
189 
190 def main():
191     global TABLE
192     # Creates or opens a SQLite3 DB
193     with sqlite3.connect( DB ) as db:
194         TABLE = get_table(db)
195         print(f'{ DB } bd opened')
196         if TABLE:
197             print(f'{TABLE} table exists')
198 
199         #command loop
200         while True:
201             try:
202                 cmd = input(f'{ CMDS } cmd? ')
203                 if not cmd in CMDS:
204                     print(f'cmd { cmd } unknown; h for help')
205                     continue
206                 elif cmd == 'q': # quit
207                     print(f'quitting')
208                     break
209                 else:
210                     cmd_exe(db, cmd)
211             except Exception as ex:
212                 raise
213 
214 
215 if __name__=='__main__':
216     main()

This code is downloadable from github.

Warnings

What else to do in order to use full text search with Python and SQLite 3? Provided we use an updated version of Python, there aren't other needs: all included in Python batteries! :-)

Attention to the fact that virtual tables are powered with pure text. This means that if we use a markup language, as is normally the case, we have to delete it before passing the contents of the record to the virtual table. If we use html directly, it is possible to entrust the task to Python library beautifulsoup4:

from BeautifulSoup import BeautifulSoup

html = '<div> <p>my content</p></div>'
text = ''.join(BeautifulSoup(html).findAll(text=True))

If we use another markup language, check if it's possible convert it to and then to use beautifulsoup4 [6]:

from BeautifulSoup import BeautifulSoup
from markdown import markdown

html = markdown(some_markdown_string)
text = ''.join(BeautifulSoup(html).findAll(text=True))

Almost I forgot: what about documentation? Here we find the FTS3 e FTS4 documentation; while here we find the FTS5 documentation.

And then?. And then, with this article I touched only the surface of the problem. Limiting myself to observing how to create, in a simple way, the virtual table of search support, and how to insert, from one source only, the records to be searched. But it is necessary deepen various aspects. For example: how does commit behave? How to make changes or cancellations of records? What happens in case of concurrent requests? ...

Conclusions

An environment worth exploring in depth. Moreover also because it allows to have data and indexes enclosed in a single file, in addition to being a library that requires few resources.

The obvious disadvantage is the fact that it binds the application to the DBMS: SQLite 3. Which usually isn't appreciated by those who develop in Django. One of the strengths of Django is the fact that its data modeling language, used with a minimum of attention, make the application DBMS unaware, allowing to use it with SQLite rather than PostgreSQL rather than MySQL (and others).

Enjoy by ldfa


[1]Data Base Management System.
[2]Full Text Search.
[3]I read previous versions of SQLite 3 on MS Windows didn't support FTS3. It was necessary to compile SQLite 3 to obtain this feature under Windows.
[4]In the sense that if the table is called people, the virtual table is will call v_persone (plus other support tables managed by SQLite 3, which we do not need to explicitly manage).
[5]Opening the DB, if SQLite 3 doesn't find it, it creates a new empty instance.
[6]As suggested in stackoverflow.