71

I have a pandas data set, called 'df'.

How can I do something like below;

df.query("select * from df")

Thank you.

For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python

user1717828
  • 6,812
  • 6
  • 31
  • 52
Miguel Santos
  • 1,468
  • 3
  • 13
  • 26

9 Answers9

116

This is not what pandas.query is supposed to do. You can look at package pandasql (same like sqldf in R )

import pandas as pd
import pandasql as ps

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
               [1234, 'Customer A', np.nan, '333 Street'],
               [1233, 'Customer B', '444 Street', '333 Street'],
              [1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])

q1 = """SELECT ID FROM df """

print(ps.sqldf(q1, locals()))

     ID
0  1234
1  1234
2  1233
3  1233

Update 2020-07-10

update the pandasql

ps.sqldf("select * from df")
AdamAL
  • 1,342
  • 1
  • 13
  • 18
BENY
  • 296,997
  • 19
  • 147
  • 204
  • I get this error due to numpy not being imported: Traceback (most recent call last): File "", line 1, in NameError: name 'np' is not defined – Jas Aug 15 '19 at 20:21
  • 1
    @Jas that was just data import if you change the np.nan to 1000 , it will gone – BENY Aug 15 '19 at 21:09
  • FYI, it doesn't look like this works anymore. I get the error `AttributeError: 'Connection' object has no attribute 'cursor'`. It might work on older versions of `pandas`; I'm using v1.3.4. – Matt Sosna Dec 01 '21 at 21:38
  • @MattSosna it still work for me ~ – BENY Dec 01 '21 at 21:45
22

After some time of using this I realised the easiest way is to just do

from pandasql import sqldf

output = sqldf("select * from df")

Works like a charm where df is a pandas dataframe You can install pandasql: https://pypi.org/project/pandasql/

Miguel Santos
  • 1,468
  • 3
  • 13
  • 26
  • 1
    Nice , I still with the old version pandasql , also update the my answer, thanks ~ – BENY Jul 10 '20 at 14:28
4

You can use DataFrame.query(condition) to return a subset of the data frame matching condition like this:

df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
   A  B  C
0  0  1  2
1  3  4  5
2  6  7  8

df.query('C < 6')
   A  B  C
0  0  1  2
1  3  4  5


df.query('2*B <= C')
   A  B  C
0  0  1  2


df.query('A % 2 == 0')
   A  B  C
0  0  1  2
2  6  7  8

This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE is implied.

user1717828
  • 6,812
  • 6
  • 31
  • 52
  • You can add. `df.eval` https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.eval.html – BENY Aug 24 '17 at 15:55
4

There's actually a new package that I just released, called dataframe_sql. This gives you the ability to query pandas dataframes using SQL just as you want to. You can find the package here

Zach Brookler
  • 304
  • 2
  • 7
  • i was not able to find a conda distribution for the same and our support team can only use conda tools to install packages, not pip. Do you have any plan on releasing it? – Sajal May 26 '21 at 18:50
  • @Sajal I can certainly look into it, would you mind raising this as an issue on GitHub? – Zach Brookler May 31 '21 at 16:55
2

Much better solution is to use duckdb

pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})

duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe

Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data

nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms

A improvement of speed of roughly 100x

This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html

Leo Liu
  • 171
  • 5
1

Or, you can use the tools that do what they do best:

  1. Install postgresql

  2. Connect to the database:

from sqlalchemy import create_engine
import urllib.parse
engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
dbengine = create_engine(engconnect)
database = dbengine.connect()

  1. Dump the dataframe into postgres

df.to_sql('mytablename', database, if_exists='replace')

  1. Write your query with all the SQL nesting your brain can handle.

myquery = "select distinct * from mytablename"

  1. Create a dataframe by running the query:

newdf = pd.read_sql(myquery, database)

alphacrash
  • 27
  • 2
1

I think a better solution than pandassql would be duckdb. The way it handles the table name mapping to a dataframe object is a little cleaner imo. I have not evaluated performance though.

0

There is also FugueSQL

pip install fugue[sql]
import pandas as pd
from fugue_sql import fsql

comics_df = pd.DataFrame({'book': ['Secret Wars 8',
                                   'Tomb of Dracula 10',
                                   'Amazing Spider-Man 252',
                                   'New Mutants 98',
                                   'Eternals 1',
                                   'Amazing Spider-Man 300',
                                   'Department of Truth 1'],
                          'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
                          'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
                          'value': [400, 2500, 300, 600, 400, 750, 175]})

# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""

fsql(query).run()

Output

PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8                                                 |Marvel       |9.6         |400       
Eternals 1                                                    |Marvel       |9.2         |400       
Department of Truth 1                                         |Image        |9.8         |175       
Total count: 3

References

https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html

https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html

SchemeSonic
  • 176
  • 2
  • 12
0

Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql magic command to use in Jupyter/IPyhon:

# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC

In this example my_cars_df is a Pandas Dataframe.

You can try it in this demo Google Colab notebook.

mechatroner
  • 1,122
  • 1
  • 16
  • 25