Created on 31 Aug 2018 ; Modified on 19 Sep 2018 ; Translation: english
Cominciando ad indagare su come implementare la ricerca a testo libero in una applicazione WEB based sviluppata con Python/Django, ho letto che le versioni moderne di SQLite 3, la libreria DBMS [1] distribuita con Python, la supportano.
Lo fanno utilizzando una libreria di supporto FTS [2], da usare nelle versioni 3 e successive (FTS3, FTS4 e FTS5, al momento in cui scrivo; le versioni 1 e 2 sono da evitare perché buggate).
Inoltre, sempre al momento in cui scrivo, almeno FTS3 è supportata da SQLite 3 distribuito con Python 3.6 su sistema MS Windows 10 (64 bit) [3].
Siccome lavoro spesso con SQLite su Python tramite Django, ho deciso di provare questa feature per verificare la modalità d'uso. Ed ho osservato che è decisamente semplice.
Ad esempio, senza farci spaventare dalla lunghezza, prendiamo il programma sqlite_fts.py di cui riporto il listing in coda all'articolo: Listing di sqlite_fts.py.
La funzione create_table(db, tname) (linee 102-117) crea due tabelle con struttura fissa (per semplicità di codifica):
La tabella v_{tname} è speciale: alimenta degli appositi indici gestiti da SQLite 3/FTS3 per effettuare le ricerche a testo libero. Il suo scopo primario non è quello di registrare le informazioni, ma solo di alimentare gli indici. Se necessario è possibile utilizzarla anche per registrare le informazioni. Cosa che nel ns programma di prova facciano in {tname}.
La funzione insert(db, tname, persons) (linee 64-75) inserisce alcune persone nel DB. In particolare esegue due inserimenti:
La funzione fts(db, tname, what) (linee 90-99) esemplifica la ricerca a testo libero, fatta con una SELECT sulla tabella virtuale v_{tname} che utilizza l'operatore MATCH. La linea 95 chiede di matchare la stringa what sul campo profile. La stringa what può essere una parola, come una frase.
Il resto del codice é routine. In particolare il main() esegue l'apertura del DB [5] alla linea 193, e alle linee 200-212 esegue un ciclo infinito per eseguire interattivamente i comandi dell'operatore.
L'esecuzione dei comandi è affidata alla funzione cmd_exe(db, cmd) (linee 142-172).
Le altre funzioni sono di supporto: drop delle tabelle, list dei record, e così via.
I dati inseriti per prova nelle tabelle sono embedded nella variabile PERSONS (linee 41-60 del sorgente).
Facciamo un esempio di sessione di prova.
Supponiamo di avere creato il programma sqlite_fts.py in una directory temporanea di prova: test. Eseguiamo i seguenti passi:
>cd test # go to test directory >mkdir data # create data dir for DB >python sqlite_fts.py # execute program data/db.sqlite3 bd opened ('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? c # create tables persons table name to create? (^C to stop) persons creating table(s) persons and v_persons ('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? i # insert 5 records in tables (first 5 USA Presidents) inserting persons ('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? l # list persons (output cutted) listing persons George Washington: George Washington (February 22, 1732[b][c] – December 14, 1799) was the first President ... ... CUT ... ('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts # query for "virginia": 4 Presidents full text search of what? (^C to stop) virginia searching virginia in v_persons George Washington Thomas Jefferson James Madison Jr. James Monroe ('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts # query for "virginia AND congress": 3 Presidents full text search of what? (^C to stop) virginia congress searching virginia congress in v_persons Thomas Jefferson James Madison Jr. James Monroe ('c', 'd', 'fts', 'h', 'i', 'l', 'q') cmd? fts # query for "virginia AND national AND congress": 2 Presidents full text search of what? (^C to stop) virginia national congress searching virginia national congress in v_persons Thomas Jefferson James Monroe
1 # sqlite_fts.py 2 # VERY BASIC: how manage sqlite3 using python AND do Full Text Search on it 3 # 4 # author: luciano de falco alfano 5 # license CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/ 6 # date: 2018.08.31 7 # 8 # overall logic: 9 # - main() open DB and execute an infinite loop to process user commands 10 # - cmd_exe() execute user command: 11 # - create_table() create a table with fixed record structure, and its 12 # counterpart virtual table even with fixed structure 13 # - drop_table() drop a single table 14 # - insert() insert some record 15 # - list() list all records 16 # - fts() full text search on records, list result keys 17 # 18 # excerpts from: 19 # - pure sqlite example: https://www.pythoncentral.io/introduction-to-sqlite-in-python/ 20 # https://www.pythoncentral.io/advanced-sqlite-usage-in-python/ 21 # - fts3 example: http://www.rkblog.rk.edu.pl/w/p/fulltext-search-sqlite-and-django-app/ 22 # - fts3 doc: https://www.sqlite.org/fts3.html 23 24 import pdb 25 import sqlite3 26 27 DB = 'data/db.sqlite3' 28 29 # dictionary of commands 30 DoCMDS = {'c': 'create table', 31 'd': 'drop table', 32 'fts': 'full text search', 33 'h': 'help', 34 'i': 'insert persons', 35 'l': 'list persons', 36 'q': 'quit', 37 } 38 CMDS = tuple([cmd for cmd in DoCMDS.keys()]) 39 40 # the first 5 presidents of USA, profile incipit from wikipedia 41 PERSONS = [ ('George Washington', 42 '''George Washington (February 22, 1732[b][c] – December 14, 1799) was the first President of the United States, and was among the nation's Founding Fathers. As commander-in-chief of the Continental Army during the American Revolutionary War, he was the principal force in victory, effected with the surrender of the British at Yorktown. In 1787 he presided over the Constitutional Convention which formed the new federal government. Since the late 1780s, Washington has been known as the "Father of His Country" by compatriots. Scholarly and public polling ranks him among the top three Presidents in history. 43 44 Washington was born to a moderately prosperous family of planters, who owned slaves in colonial Virginia. He had early opportunities in education, learned mathematics and quickly launched a successful career as a surveyor, which in turn enabled him to make considerable land investments. He then joined the Virginia militia and fought in the French and Indian War. During the war with the British he also fought against one of their allies, the Iroquois nation. His devotion to American Republicanism impelled him to decline further power after victory, and he resigned as commander-in-chief in 1783. ''', ), 45 ('John Adams', 46 '''John Adams (October 30 [O.S. October 19] 1735 – July 4, 1826) was an American statesman and Founding Father who served as the first Vice President (1789–1797) and second President of the United States (1797–1801). He was a lawyer, diplomat, political theorist, and a leader of the movement for American independence from Great Britain. He was also a dedicated diarist and correspondent, particularly with his wife and closest advisor, Abigail. 47 48 Adams collaborated with his cousin, revolutionary leader Samuel Adams, but established his own prominence prior to the American Revolution. Driven by his devotion to the right to counsel and the presumption of innocence, he defied local anti-British sentiment and provided a successful legal defense of the accused British soldiers involved in the Boston Massacre. Adams was sent as a delegate from colonial Massachusetts to the Continental Congress, where he played a leading role in persuading Congress to declare independence. He assisted in drafting the Declaration of Independence in 1776, and was its foremost advocate in Congress. As a diplomat in Europe, he helped negotiate the peace treaty with Great Britain and acquired vital governmental loans from Amsterdam bankers. Adams was the primary author of the Massachusetts Constitution in 1780, which influenced the development of America's own constitution, as did his earlier Thoughts on Government. ''',), 49 ('Thomas Jefferson', 50 '''Thomas Jefferson (April 13, [O.S. April 2] 1743 – July 4, 1826) was an American Founding Father who was the principal author of the Declaration of Independence and later served as the third president of the United States from 1801 to 1809. Previously, he had been elected the second vice president of the United States, serving under John Adams from 1797 to 1801. He was a proponent of democracy, republicanism, and individual rights motivating American colonists to break from Great Britain and form a new nation; he produced formative documents and decisions at both the state and national level. 51 52 Jefferson was mainly of English ancestry, born and educated in colonial Virginia. He graduated from the College of William & Mary and briefly practiced law, with the largest number of his cases concerning land ownership claims.[1] During the American Revolution, he represented Virginia in the Continental Congress that adopted the Declaration, drafted the law for religious freedom as a Virginia legislator, and served as a wartime governor (1779–1781). He became the United States Minister to France in May 1785, and subsequently the nation's first Secretary of State in 1790–1793 under President George Washington. Jefferson and James Madison organized the Democratic-Republican Party to oppose the Federalist Party during the formation of the First Party System. With Madison, he anonymously wrote the controversial Kentucky and Virginia Resolutions in 1798–1799, which sought to strengthen states' rights by nullifying the federal Alien and Sedition Acts. ''',), 53 ('James Madison Jr.', 54 '''James Madison Jr. (March 16, 1751 – June 28, 1836)[2] was an American statesman and Founding Father who served as the fourth President of the United States from 1809 to 1817. He is hailed as the "Father of the Constitution" for his pivotal role in drafting and promoting the United States Constitution and the Bill of Rights. 55 56 Born into a prominent Virginia planting family, Madison served as a member of the Virginia House of Delegates and the Continental Congress during and after the American Revolutionary War. In the late 1780s, he helped organize the Constitutional Convention, which produced a new constitution to supplant the ineffective Articles of Confederation. After the Convention, Madison became one of the leaders in the movement to ratify the Constitution, and his collaboration with Alexander Hamilton produced The Federalist Papers, among the most important treatises in support of the Constitution. ''',), 57 ('James Monroe', 58 '''James Monroe (/mənˈroʊ/; April 28, 1758 – July 4, 1831) was an American statesman and Founding Father who served as the fifth President of the United States from 1817 to 1825. Monroe was the last president of the Virginia dynasty, and his presidency ushered in what is known as the Era of Good Feelings. Born in Westmoreland County, Virginia, Monroe was of the planter class and fought in the American Revolutionary War. He was wounded in the Battle of Trenton with a musket ball to the shoulder. After studying law under Thomas Jefferson from 1780 to 1783, he served as a delegate in the Continental Congress.[1] 59 60 As an anti-federalist delegate to the Virginia convention that considered ratification of the United States Constitution, Monroe opposed ratification, claiming it gave too much power to the central government. Nonetheless, he took an active part in the new government, and in 1790 was elected to the Senate of the first United States Congress, where he joined the Democratic-Republicans. He gained experience as an executive as the Governor of Virginia and rose to national prominence as a diplomat in France when he helped negotiate the Louisiana Purchase in 1803. During the War of 1812, Monroe served in critical roles as Secretary of State and the Secretary of War under President James Madison.[2] ''',), ] 61 62 TABLE = '' 63 64 def insert(db, tname, persons): 65 '''insert a bunch of persons in a (empty) table''' 66 67 try: 68 cursor = db.cursor() 69 for person in persons: 70 cursor.execute(f'INSERT INTO {tname}(name, profile) VALUES(?,?)', person) 71 cursor.execute(f'INSERT INTO v_{tname}(name, profile) VALUES(?,?)', person) 72 db.commit() 73 except Exception as ex: 74 db.rollback() 75 raise ex 76 77 78 def list(db, tname): 79 '''list of all persons in table''' 80 81 try: 82 cursor = db.cursor() 83 cursor.execute(f'SELECT name, profile FROM {tname}') 84 for row in cursor: 85 print(f' {row[0]}: {row[1]}') 86 except Exception as ex: 87 raise ex 88 89 90 def fts(db, tname, what): 91 '''full text search of something in virtual table v_{tname}''' 92 93 try: 94 cursor = db.cursor() 95 cursor.execute(f'SELECT name FROM v_{tname} WHERE profile MATCH ?', (what,)) 96 for row in cursor: 97 print(f' {row[0]}') 98 except Exception as ex: 99 raise ex 100 101 102 def create_table(db, tname): 103 '''create a table with id/name/profile record 104 105 note: create even a virtual table named v_{tname}''' 106 107 try: 108 cursor = db.cursor() 109 cursor.execute(f'CREATE TABLE IF NOT EXISTS {tname}(\ 110 id INTEGER PRIMARY KEY, \ 111 name TEXT, \ 112 profile TEXT)') 113 cursor.execute(f'CREATE VIRTUAL TABLE v_{tname} USING fts3(name, profile)') 114 db.commit() 115 except Exception as ex: 116 db.rollback() 117 raise ex 118 119 120 def drop_table(db, tname): 121 '''drop A SINGLE table 122 123 note: tname and v_{tname} must be dropped separately 124 this is because in case of partial drop (one of two tables) 125 program could become stuck''' 126 127 try: 128 cursor = db.cursor() 129 cursor.execute(f'DROP TABLE {tname}') 130 db.commit() 131 except Exception as ex: 132 db.rollback() 133 raise ex 134 135 def help(): 136 '''list commands''' 137 138 for cmd in CMDS: 139 print(f' {cmd}: {DoCMDS[cmd]}') 140 print() 141 142 def cmd_exe(db, cmd): 143 ''' execute command, BUT not 'quit' ''' 144 145 global TABLE 146 try: 147 #pdb.set_trace() 148 if not cmd in CMDS: 149 raise ValueError(f'ERROR: cmd {cmd} unknown') 150 elif cmd == 'c': # create table 151 tname = input(f'table name to create? (^C to stop) ') 152 print(f' creating table(s) {tname} and v_{tname}') 153 create_table(db, tname) 154 TABLE = tname[:] 155 elif cmd == 'd': # drop table 156 tname = input(f'table name to drop? (^C to stop) ') 157 print(f' deleting table {tname}') 158 drop_table(db, tname) 159 elif cmd == 'fts': # full text search 160 what = input(f'full text search of what? (^C to stop) ') 161 print(f' searching {what} in v_{TABLE}') 162 fts(db, TABLE, what) 163 elif cmd == 'i': # insert persons 164 print(f' inserting persons') 165 insert(db, TABLE, PERSONS) 166 elif cmd == 'l': # list persons 167 print(f' listing persons') 168 list(db, TABLE) 169 else: 170 raise ValueError(f'WARNING: cmd {cmd} unsupported ') 171 except Exception as ex: 172 raise ex 173 174 def get_table(db): 175 '''return 1st table name not starting with v_ prefix''' 176 177 try: 178 cursor = db.cursor() 179 cursor.execute("SELECT name FROM sqlite_master WHERE type = 'table'") 180 tables = [] 181 for row in cursor: 182 if row[0].split('_')[0]=='v': 183 continue 184 else: 185 return row[0] 186 except Exception as ex: 187 raise ex 188 189 190 def main(): 191 global TABLE 192 # Creates or opens a SQLite3 DB 193 with sqlite3.connect( DB ) as db: 194 TABLE = get_table(db) 195 print(f'{ DB } bd opened') 196 if TABLE: 197 print(f'{TABLE} table exists') 198 199 #command loop 200 while True: 201 try: 202 cmd = input(f'{ CMDS } cmd? ') 203 if not cmd in CMDS: 204 print(f'cmd { cmd } unknown; h for help') 205 continue 206 elif cmd == 'q': # quit 207 print(f'quitting') 208 break 209 else: 210 cmd_exe(db, cmd) 211 except Exception as ex: 212 raise 213 214 215 if __name__=='__main__': 216 main()
Questo codice è scaricabile da github.
Che altro c'e da fare per poter utilizzare la ricerca a testo libero con Python e SQLite 3? A patto di usare una versione aggiornata di Python, non c'è bisogno d'altro: tutto incluso nelle batterie di Python! :-)
Attenzione al fatto che le tabelle virtuali vanno alimentate con testo puro. Questo significa che se utilizziamo un linguaggio di markup, come normalmente é, dobbiamo eliminarlo prima di passare il contenuto del record alla tabella virtuale. Se usiamo direttamente html, è possibile affidare l'incombenza alla libreria Python beautifulsoup4:
from BeautifulSoup import BeautifulSoup html = '<div> <p>my content</p></div>' text = ''.join(BeautifulSoup(html).findAll(text=True))
Se proveniamo da un altro linguaggio di markup, verifichiamo se é possibile trasformarlo in html e poi usare beautifulsoup4 [6]:
from BeautifulSoup import BeautifulSoup from markdown import markdown html = markdown(some_markdown_string) text = ''.join(BeautifulSoup(html).findAll(text=True))
Dimenticavo, e la documentazione? Qui troviamo quella relativa a FTS3 e FTS4, mentre qui troviamo quella relativa a FTS5.
E poi?. E poi, con questo articolo ho sfiorato la superficie del problema. Limitandomi ad osservare come creare, in modo semplice, la tabella virtuale di supporto alla ricerca, e come inserire, da una sola sorgente, i record da sottoporre a ricerca. Ma è necessario approfondire vari aspetti. Ad esempio: la commit come si comporta? Come fare modifiche o cancellazioni di record? Che succede in caso di richieste concorrenti? ...
Un ambiente di cui vale la pena approfondire gli aspetti. Tra l'altro anche perché permette di avere dati e indici racchiusi in un'unico file, oltre al fatto di essere una libreria che richiede poche risorse.
Lo svantaggio evidente è il fatto che lega l'applicazione al DBMS: SQLite 3. Cosa solitamente non gradita a chi sviluppa in Django. Uno dei punti di forza di questo ambiente è il fatto che, usando con un minimo di attenzione il suo linguaggio di modellazione dei dati, l'applicazione sviluppata può essere facilmente installata utilizzando DBMS diversi: SQLite piuttosto che PostgreSQL piuttosto che MySQL (ed altri).
Enjoy by ldfa
[1] | Data Base Management System. |
[2] | Full Text Search. |
[3] | Ho letto che versioni prededenti di SQLite 3 su MS Windows non supportavano FTS3. Era necessario compilare SQLite 3 ad hoc per avere questa feature in Windows. |
[4] | Nel senso che se la ns tabella si chiama persone, la tabella virtuale si chiamerà v_persone (più altre tabelle di appoggio gestite da SQLite 3, che noi non abbiamo necessità di nominare esplicitamente). |
[5] | All'apertura del DB, se SQLite 3 non lo trova, ne crea una istanza vuota. |
[6] | Come suggerito in stackoverflow. |