Retrieving Datasets¶
There are several ways to retrieve experiment data from the database:
Retrieving using download_datafiles
¶
The simplest way to retrieve data is using the download_datafiles command. This creates three csv files containing the three kinds of data: trial data, question data, and event data.
Retrieving programmatically¶
While the download_datafiles
shell command is the simplest way to retrieve
experiment data, a more powerful and flexible solution is to retrieve the data
programmatically. Many languages offer libraries for interfacing with mysql and
sqlite databases - below is an example using python and the sqlalchemy package
to retrieve data from a mysql database. We add +pymysql to the db_url to let
sqlalchemy make use of pymysql package. (You can leave the database_url in config.txt
as mysql:// though – psiturk adds +pymysql internally). By including code such as this at the
beginning of your analysis script, you can be sure the the data you’re analyzing is
always complete and up-to-date.
from sqlalchemy import create_engine, MetaData, Table
import json
import pandas as pd
db_url = "mysql+pymysql://username:password@host.org/database_name"
table_name = 'my_experiment_table'
data_column_name = 'datastring'
# boilerplace sqlalchemy setup
engine = create_engine(db_url)
metadata = MetaData()
metadata.bind = engine
table = Table(table_name, metadata, autoload=True)
# make a query and loop through
s = table.select()
rows = s.execute()
data = []
#status codes of subjects who completed experiment
statuses = [3,4,5,7]
# if you have workers you wish to exclude, add them here
exclude = []
for row in rows:
# only use subjects who completed experiment and aren't excluded
if row['status'] in statuses and row['uniqueid'] not in exclude:
data.append(row[data_column_name])
# Now we have all participant datastrings in a list.
# Let's make it a bit easier to work with:
# parse each participant's datastring as json object
# and take the 'data' sub-object
data = [json.loads(part)['data'] for part in data]
# insert uniqueid field into trialdata in case it wasn't added
# in experiment:
for part in data:
for record in part:
record['trialdata']['uniqueid'] = record['uniqueid']
# flatten nested list so we just have a list of the trialdata recorded
# each time psiturk.recordTrialData(trialdata) was called.
data = [record['trialdata'] for part in data for record in part]
# Put all subjects' trial data into a dataframe object from the
# 'pandas' python library: one option among many for analysis
data_frame = pd.DataFrame(data)
How the datastring is structured¶
The main data from an experiment participant is held in a string of text in the datastring field of the data table. Understanding how this string is structured is important to be able to parse the string into a useful format for your analyses.
The datastring is structured as a json object. In the description that follows, sub-objects are indicated by names wrapped in angle brackets (< >).
Top Level¶
The top level of the datastring contains summary information about the worker, as well as the datastring sub-objects:
{"condition": condition,
"counterbalance": counterbalance,
"assignmentId": assignmentId,
"workerId": workerId,
"hitId": hitId,
"currenttrial": trial_number_when_data_was_saved,
"useragent": useragent,
"data": <data>,
"questiondata": <questiondata>,
"eventdata": <eventdata>,
"mode": <mode>}
data¶
The data sub-object contains a list of the data recorded each time psiturk.recordTrialData() is called in the experiment:
[{"uniqueid": uniqueid,
"current_trial": current_trial_based_on_#_of_calls_to_recordTrialData,
"dataTime": current_time_in_system_time,
"trialdata": <datalist>},
...
]
Here, <datalist>
is whatever is passed to psiturk.recordTrialData()
in the
experiment. This could be in any format, such as a string or list, but we
recommend saving data in a json format so that all data is stored in a clear,
easy-to-parse “field-value” format.
questiondata¶
The questiondata sub-object contains all items recorded using psiturk.recordUnstructuredlData().
{"field1": value1,
"field2": value2,
...
}
eventdata¶
The eventdata sub-object contains a list of events (such as window resizing) that occurred during the experiments:
[{"eventtype": eventtype,
"value": value,
"timestamp": current_time_in_system_time,
"interval": interval},
...
]