Let's explain Wordify with an example. Imagine you are thinking about having a glass of wine with your friends; you know you like bold, woody wine, but are unsure which one to choose. You wonder whether there are some words that describe each type of wine; since you are a researcher, you decide to approach the problem scientifically.
You go to your favorite platform and for each type of wine (label) you dowload some reviews (texts).
You use our platform to wordify your data.
You receive by email the results of the analysis which will tell you the most indicative words, both negative and positive, for each type of wine.
Step 1. Prepare your data
Create an Excel file with two columns. Name your columns "text" and "label". Copy each of your texts into its own row in the first column, and add the respective label in the second column.
To have reliable results, we suggest providing at least 2000 labelled texts. If you provide less we will still wordify your file, but the results should then be taken with a grain of salt.
We currently do not support multi-language texts, therefore your texts should be in one language. FAQ to see those supported.
Step 2. Upload your file
Once you have prepared your Excel file, click the "Choose File" button. Browse for your file.
Choose the language of your texts. Check out the FAQ to see those supported. Provide your email. We will process your data and you will receive your wordified file by email. Depending on the number of requests, it can take up to 30 minutes (but usually 3-4 are enough). No data is stored on our server.
Wordify your data!
A way to find out which terms are most indicative for each of your dependent variable values.
Nothing. We never store the data you upload on disk: it is only kept in memory for the duration of the modeling, and then deleted. We do not retain any copies or traces of your data.
The file you upload should be .xlsx, with two columns: the first should be labeled 'text' and contain all your documents (e.g., tweets, reviews, patents, etc.), one per line. The second column should be labeled 'label', and contain the dependent variable label associated with each text (e.g., rating, author gender, company, etc.).
It uses a variant of Stability Selection (Meinshausen and Bühlmann, 2010) to fit hundreds of logistic regression models on random subsets of the data, using different L1 penalties to drive as many of the term coefficients to 0. Any terms that receive a non-zero coefficient in at least 30% of all model runs can be seen as stable indicators.
We recommend at least 2000 instances, the more, the better. With fewer instances, the results are less replicable and reliable.
Yes please! Reference coming soon...
Currently we support: English, German, Dutch, Spanish, French, Portuguese, Italian, Greek.
Via Röntgen n. 1, Milan 20136 (ITALY)
+39 02 5836 2604