Created on 31 Aug 2018 ;    Modified on 19 Sep 2018 ;    Translationenglish

La ricerca a testo libero con Python - N.1

Cominciando ad indagare su come implementare la ricerca a testo libero in una applicazione WEB based sviluppata con Python/Django, ho letto che le versioni moderne di SQLite 3, la libreria DBMS [1] distribuita con Python, la supportano.

Lo fanno utilizzando una libreria di supporto FTS [2], da usare nelle versioni 3 e successive (FTS3, FTS4 e FTS5, al momento in cui scrivo; le versioni 1 e 2 sono da evitare perché buggate).

Inoltre, sempre al momento in cui scrivo, almeno FTS3 è supportata da SQLite 3 distribuito con Python 3.6 su sistema MS Windows 10 (64 bit) [3].

Siccome lavoro spesso con SQLite su Python tramite Django, ho deciso di provare questa feature per verificare la modalità d'uso. Ed ho osservato che è decisamente semplice.

Ad esempio, senza farci spaventare dalla lunghezza, prendiamo il programma sqlite_fts.py di cui riporto il listing in coda all'articolo: Listing di sqlite_fts.py.

La funzione create_table(db, tname) (linee 102-117) crea due tabelle con struttura fissa (per semplicità di codifica):

  • la tabella {tname} il cui record ha i campi id/name/profile; di cui usiamo name come chiave; le ricerche le faremo su profile;
  • una tabella virtuale FTS3 di nome v_{tname} [4] il cui record ha i campi name/profile; il primo lo useremo come collegamento al record nella tabella {tname}, il secondo lo alimenteremo con il testo che sarà oggetto della ricerca a testo libero.

La tabella v_{tname} è speciale: alimenta degli appositi indici gestiti da SQLite 3/FTS3 per effettuare le ricerche a testo libero. Il suo scopo primario non è quello di registrare le informazioni, ma solo di alimentare gli indici. Se necessario è possibile utilizzarla anche per registrare le informazioni. Cosa che nel ns programma di prova facciano in {tname}.

La funzione insert(db, tname, persons) (linee 64-75) inserisce alcune persone nel DB. In particolare esegue due inserimenti:

  • alla linea 70 effettua l'inserimento nella tabella {tname}, quella fatta per registrare i dati;
  • alla linea 71 inserisce gli stessi dati nella tabella virtuale v_{tname}, quella che deve alimentare gli indici per la ricerca a testo libero.

La funzione fts(db, tname, what) (linee 90-99) esemplifica la ricerca a testo libero, fatta con una SELECT sulla tabella virtuale v_{tname} che utilizza l'operatore MATCH. La linea 95 chiede di matchare la stringa what sul campo profile. La stringa what può essere una parola, come una frase.

Il resto del codice é routine. In particolare il main() esegue l'apertura del DB [5] alla linea 193, e alle linee 200-212 esegue un ciclo infinito per eseguire interattivamente i comandi dell'operatore.

L'esecuzione dei comandi è affidata alla funzione cmd_exe(db, cmd) (linee 142-172).

Le altre funzioni sono di supporto: drop delle tabelle, list dei record, e così via.

I dati inseriti per prova nelle tabelle sono embedded nella variabile PERSONS (linee 41-60 del sorgente).

Un esempio di sessione di prova

Facciamo un esempio di sessione di prova.

Supponiamo di avere creato il programma sqlite_fts.py in una directory temporanea di prova: test. Eseguiamo i seguenti passi:

>cd test                               # go to test directory
>mkdir data                            # create data dir for DB
>python sqlite_fts.py                  # execute program
data/db.sqlite3 bd opened
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? c             # create tables persons
table name to create? (^C to stop) persons
    creating table(s) persons and v_persons
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? i             # insert 5 records in tables (first 5 USA Presidents)
    inserting persons
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? l             # list persons (output cutted)
    listing persons
    George Washington: George Washington (February 22, 1732[b][c] – December 14, 1799) was the first President ...
    ... CUT ...
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts           # query for "virginia": 4 Presidents
full text search of what? (^C to stop) virginia
    searching virginia in v_persons
    George Washington
    Thomas Jefferson
    James Madison Jr.
    James Monroe
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts           # query for "virginia AND congress": 3 Presidents
full text search of what? (^C to stop) virginia  congress
    searching virginia  congress in v_persons
    Thomas Jefferson
    James Madison Jr.
    James Monroe
('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts           # query for "virginia AND national AND congress": 2 Presidents
full text search of what? (^C to stop) virginia national congress
    searching virginia national congress in v_persons
    Thomas Jefferson
    James Monroe

Listing di sqlite_fts.py

  1 # sqlite_fts.py
  2 #    VERY BASIC: how manage sqlite3 using python AND do Full Text Search on it
  3 #
  4 # author: luciano de falco alfano
  5 # license CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/
  6 # date:   2018.08.31
  7 #
  8 # overall logic:
  9 #    - main() open DB and execute an infinite loop to process user commands
 10 #    - cmd_exe() execute user command:
 11 #          - create_table() create a table with fixed record structure, and its
 12 #                            counterpart virtual table even with fixed structure
 13 #          - drop_table() drop a single table
 14 #          - insert()     insert some record
 15 #          - list()       list all records
 16 #          - fts()        full text search on records, list result keys
 17 #
 18 # excerpts from:
 19 #     - pure sqlite example: https://www.pythoncentral.io/introduction-to-sqlite-in-python/
 20 #                            https://www.pythoncentral.io/advanced-sqlite-usage-in-python/
 21 #     - fts3 example:        http://www.rkblog.rk.edu.pl/w/p/fulltext-search-sqlite-and-django-app/
 22 #     - fts3 doc:            https://www.sqlite.org/fts3.html
 23 
 24 import pdb
 25 import sqlite3
 26 
 27 DB = 'data/db.sqlite3'
 28 
 29 # dictionary of commands
 30 DoCMDS = {'c': 'create table',
 31           'd': 'drop table',
 32           'fts': 'full text search',
 33           'h': 'help',
 34           'i': 'insert persons',
 35           'l': 'list persons',
 36           'q': 'quit',
 37          }
 38 CMDS = tuple([cmd for cmd in DoCMDS.keys()])
 39 
 40 # the first 5 presidents of USA, profile incipit from wikipedia
 41 PERSONS = [ ('George Washington',
 42     '''George Washington (February 22, 1732[b][c] – December 14, 1799) was the first President of the United States, and was among the nation's Founding Fathers. As commander-in-chief of the Continental Army during the American Revolutionary War, he was the principal force in victory, effected with the surrender of the British at Yorktown. In 1787 he presided over the Constitutional Convention which formed the new federal government. Since the late 1780s, Washington has been known as the "Father of His Country" by compatriots. Scholarly and public polling ranks him among the top three Presidents in history.
 43 
 44 Washington was born to a moderately prosperous family of planters, who owned slaves in colonial Virginia. He had early opportunities in education, learned mathematics and quickly launched a successful career as a surveyor, which in turn enabled him to make considerable land investments. He then joined the Virginia militia and fought in the French and Indian War. During the war with the British he also fought against one of their allies, the Iroquois nation. His devotion to American Republicanism impelled him to decline further power after victory, and he resigned as commander-in-chief in 1783. ''', ),
 45     ('John Adams',
 46     '''John Adams (October 30 [O.S. October 19] 1735 – July 4, 1826) was an American statesman and Founding Father who served as the first Vice President (1789–1797) and second President of the United States (1797–1801). He was a lawyer, diplomat, political theorist, and a leader of the movement for American independence from Great Britain. He was also a dedicated diarist and correspondent, particularly with his wife and closest advisor, Abigail.
 47 
 48 Adams collaborated with his cousin, revolutionary leader Samuel Adams, but established his own prominence prior to the American Revolution. Driven by his devotion to the right to counsel and the presumption of innocence, he defied local anti-British sentiment and provided a successful legal defense of the accused British soldiers involved in the Boston Massacre. Adams was sent as a delegate from colonial Massachusetts to the Continental Congress, where he played a leading role in persuading Congress to declare independence. He assisted in drafting the Declaration of Independence in 1776, and was its foremost advocate in Congress. As a diplomat in Europe, he helped negotiate the peace treaty with Great Britain and acquired vital governmental loans from Amsterdam bankers. Adams was the primary author of the Massachusetts Constitution in 1780, which influenced the development of America's own constitution, as did his earlier Thoughts on Government. ''',),
 49     ('Thomas Jefferson',
 50     '''Thomas Jefferson (April 13, [O.S. April 2] 1743 – July 4, 1826) was an American Founding Father who was the principal author of the Declaration of Independence and later served as the third president of the United States from 1801 to 1809. Previously, he had been elected the second vice president of the United States, serving under John Adams from 1797 to 1801. He was a proponent of democracy, republicanism, and individual rights motivating American colonists to break from Great Britain and form a new nation; he produced formative documents and decisions at both the state and national level.
 51 
 52 Jefferson was mainly of English ancestry, born and educated in colonial Virginia. He graduated from the College of William & Mary and briefly practiced law, with the largest number of his cases concerning land ownership claims.[1] During the American Revolution, he represented Virginia in the Continental Congress that adopted the Declaration, drafted the law for religious freedom as a Virginia legislator, and served as a wartime governor (1779–1781). He became the United States Minister to France in May 1785, and subsequently the nation's first Secretary of State in 1790–1793 under President George Washington. Jefferson and James Madison organized the Democratic-Republican Party to oppose the Federalist Party during the formation of the First Party System. With Madison, he anonymously wrote the controversial Kentucky and Virginia Resolutions in 1798–1799, which sought to strengthen states' rights by nullifying the federal Alien and Sedition Acts. ''',),
 53     ('James Madison Jr.',
 54     '''James Madison Jr. (March 16, 1751 – June 28, 1836)[2] was an American statesman and Founding Father who served as the fourth President of the United States from 1809 to 1817. He is hailed as the "Father of the Constitution" for his pivotal role in drafting and promoting the United States Constitution and the Bill of Rights.
 55 
 56 Born into a prominent Virginia planting family, Madison served as a member of the Virginia House of Delegates and the Continental Congress during and after the American Revolutionary War. In the late 1780s, he helped organize the Constitutional Convention, which produced a new constitution to supplant the ineffective Articles of Confederation. After the Convention, Madison became one of the leaders in the movement to ratify the Constitution, and his collaboration with Alexander Hamilton produced The Federalist Papers, among the most important treatises in support of the Constitution. ''',),
 57     ('James Monroe',
 58     '''James Monroe (/mənˈroʊ/; April 28, 1758 – July 4, 1831) was an American statesman and Founding Father who served as the fifth President of the United States from 1817 to 1825. Monroe was the last president of the Virginia dynasty, and his presidency ushered in what is known as the Era of Good Feelings. Born in Westmoreland County, Virginia, Monroe was of the planter class and fought in the American Revolutionary War. He was wounded in the Battle of Trenton with a musket ball to the shoulder. After studying law under Thomas Jefferson from 1780 to 1783, he served as a delegate in the Continental Congress.[1]
 59 
 60 As an anti-federalist delegate to the Virginia convention that considered ratification of the United States Constitution, Monroe opposed ratification, claiming it gave too much power to the central government. Nonetheless, he took an active part in the new government, and in 1790 was elected to the Senate of the first United States Congress, where he joined the Democratic-Republicans. He gained experience as an executive as the Governor of Virginia and rose to national prominence as a diplomat in France when he helped negotiate the Louisiana Purchase in 1803. During the War of 1812, Monroe served in critical roles as Secretary of State and the Secretary of War under President James Madison.[2] ''',), ]
 61 
 62 TABLE = ''
 63 
 64 def insert(db, tname, persons):
 65     '''insert a bunch of persons in a (empty) table'''
 66 
 67     try:
 68         cursor = db.cursor()
 69         for person in persons:
 70             cursor.execute(f'INSERT INTO {tname}(name, profile) VALUES(?,?)', person)
 71             cursor.execute(f'INSERT INTO v_{tname}(name, profile) VALUES(?,?)', person)
 72         db.commit()
 73     except Exception as ex:
 74         db.rollback()
 75         raise ex
 76 
 77 
 78 def list(db, tname):
 79     '''list of all persons in table'''
 80 
 81     try:
 82         cursor = db.cursor()
 83         cursor.execute(f'SELECT name, profile FROM {tname}')
 84         for row in cursor:
 85             print(f'    {row[0]}: {row[1]}')
 86     except Exception as ex:
 87         raise ex
 88 
 89 
 90 def fts(db, tname, what):
 91     '''full text search of something in virtual table v_{tname}'''
 92 
 93     try:
 94         cursor = db.cursor()
 95         cursor.execute(f'SELECT name FROM v_{tname} WHERE profile MATCH ?', (what,))
 96         for row in cursor:
 97             print(f'    {row[0]}')
 98     except Exception as ex:
 99         raise ex
100 
101 
102 def create_table(db, tname):
103     '''create a table with id/name/profile record
104 
105        note: create even a virtual table named v_{tname}'''
106 
107     try:
108         cursor = db.cursor()
109         cursor.execute(f'CREATE TABLE IF NOT EXISTS {tname}(\
110             id INTEGER PRIMARY KEY, \
111             name TEXT, \
112             profile TEXT)')
113         cursor.execute(f'CREATE VIRTUAL TABLE v_{tname} USING fts3(name, profile)')
114         db.commit()
115     except Exception as ex:
116         db.rollback()
117         raise ex
118 
119 
120 def drop_table(db, tname):
121     '''drop A SINGLE table
122 
123     note: tname and v_{tname} must be dropped separately
124       this is because in case of partial drop (one of two tables)
125       program could become stuck'''
126 
127     try:
128         cursor = db.cursor()
129         cursor.execute(f'DROP TABLE {tname}')
130         db.commit()
131     except Exception as ex:
132         db.rollback()
133         raise ex
134 
135 def help():
136     '''list commands'''
137 
138     for cmd in CMDS:
139         print(f'    {cmd}: {DoCMDS[cmd]}')
140     print()
141 
142 def cmd_exe(db, cmd):
143     ''' execute command, BUT not 'quit' '''
144 
145     global TABLE
146     try:
147         #pdb.set_trace()
148         if not cmd in CMDS:
149             raise ValueError(f'ERROR: cmd {cmd} unknown')
150         elif cmd == 'c': # create table
151             tname = input(f'table name to create? (^C to stop) ')
152             print(f'    creating table(s) {tname} and v_{tname}')
153             create_table(db, tname)
154             TABLE = tname[:]
155         elif cmd == 'd': # drop table
156             tname = input(f'table name to drop? (^C to stop) ')
157             print(f'    deleting table {tname}')
158             drop_table(db, tname)
159         elif cmd == 'fts': # full text search
160             what = input(f'full text search of what? (^C to stop) ')
161             print(f'    searching {what} in v_{TABLE}')
162             fts(db, TABLE, what)
163         elif cmd == 'i': # insert persons
164             print(f'    inserting persons')
165             insert(db, TABLE, PERSONS)
166         elif cmd == 'l': # list persons
167             print(f'    listing persons')
168             list(db, TABLE)
169         else:
170             raise ValueError(f'WARNING: cmd {cmd} unsupported ')
171     except Exception as ex:
172         raise ex
173 
174 def get_table(db):
175     '''return 1st table name not starting with v_ prefix'''
176 
177     try:
178         cursor = db.cursor()
179         cursor.execute("SELECT name FROM sqlite_master WHERE type = 'table'")
180         tables = []
181         for row in cursor:
182             if row[0].split('_')[0]=='v':
183                 continue
184             else:
185                 return row[0]
186     except Exception as ex:
187         raise ex
188 
189 
190 def main():
191     global TABLE
192     # Creates or opens a SQLite3 DB
193     with sqlite3.connect( DB ) as db:
194         TABLE = get_table(db)
195         print(f'{ DB } bd opened')
196         if TABLE:
197             print(f'{TABLE} table exists')
198 
199         #command loop
200         while True:
201             try:
202                 cmd = input(f'{ CMDS } cmd? ')
203                 if not cmd in CMDS:
204                     print(f'cmd { cmd } unknown; h for help')
205                     continue
206                 elif cmd == 'q': # quit
207                     print(f'quitting')
208                     break
209                 else:
210                     cmd_exe(db, cmd)
211             except Exception as ex:
212                 raise
213 
214 
215 if __name__=='__main__':
216     main()

Questo codice è scaricabile da github.

Avvertenze

Che altro c'e da fare per poter utilizzare la ricerca a testo libero con Python e SQLite 3? A patto di usare una versione aggiornata di Python, non c'è bisogno d'altro: tutto incluso nelle batterie di Python! :-)

Attenzione al fatto che le tabelle virtuali vanno alimentate con testo puro. Questo significa che se utilizziamo un linguaggio di markup, come normalmente é, dobbiamo eliminarlo prima di passare il contenuto del record alla tabella virtuale. Se usiamo direttamente html, è possibile affidare l'incombenza alla libreria Python beautifulsoup4:

from BeautifulSoup import BeautifulSoup

html = '<div> <p>my content</p></div>'
text = ''.join(BeautifulSoup(html).findAll(text=True))

Se proveniamo da un altro linguaggio di markup, verifichiamo se é possibile trasformarlo in html e poi usare beautifulsoup4 [6]:

from BeautifulSoup import BeautifulSoup
from markdown import markdown

html = markdown(some_markdown_string)
text = ''.join(BeautifulSoup(html).findAll(text=True))

Dimenticavo, e la documentazione? Qui troviamo quella relativa a FTS3 e FTS4, mentre qui troviamo quella relativa a FTS5.

E poi?. E poi, con questo articolo ho sfiorato la superficie del problema. Limitandomi ad osservare come creare, in modo semplice, la tabella virtuale di supporto alla ricerca, e come inserire, da una sola sorgente, i record da sottoporre a ricerca. Ma è necessario approfondire vari aspetti. Ad esempio: la commit come si comporta? Come fare modifiche o cancellazioni di record? Che succede in caso di richieste concorrenti? ...

Conclusioni

Un ambiente di cui vale la pena approfondire gli aspetti. Tra l'altro anche perché permette di avere dati e indici racchiusi in un'unico file, oltre al fatto di essere una libreria che richiede poche risorse.

Lo svantaggio evidente è il fatto che lega l'applicazione al DBMS: SQLite 3. Cosa solitamente non gradita a chi sviluppa in Django. Uno dei punti di forza di questo ambiente è il fatto che, usando con un minimo di attenzione il suo linguaggio di modellazione dei dati, l'applicazione sviluppata può essere facilmente installata utilizzando DBMS diversi: SQLite piuttosto che PostgreSQL piuttosto che MySQL (ed altri).

Enjoy by ldfa


[1]Data Base Management System.
[2]Full Text Search.
[3]Ho letto che versioni prededenti di SQLite 3 su MS Windows non supportavano FTS3. Era necessario compilare SQLite 3 ad hoc per avere questa feature in Windows.
[4]Nel senso che se la ns tabella si chiama persone, la tabella virtuale si chiamerà v_persone (più altre tabelle di appoggio gestite da SQLite 3, che noi non abbiamo necessità di nominare esplicitamente).
[5]All'apertura del DB, se SQLite 3 non lo trova, ne crea una istanza vuota.
[6]Come suggerito in stackoverflow.