I am a longtime user and fan of Duolingo, a platform for learning new languages free of charge. And as a PhD student researching language learning and bilingualism, I have long awaited the opportunity to get my hands on any of the data they accumulate. Well I finally have. Burr Settles, of the Duolingo team, published a paper in the ACL proceedings discussing spaced repetition in Duolingo data. With this publication they shared their code and data on github. So I’m going to play with it a bit! Here’s what I want to know:
- Which languages are hardest to learn?
- The Foreign Services Institute has ranked the languages based on how long they should take to learn. If this holds for our data, German should be a little bit harder than our other languages (Spanish, French, Italian and Portugese).
- Which kinds of words are hardest to learn?
- I’ve always found verbs to be the most challenging because they, at least in the langauges listed here, involve more marking than other kinds of words (i.e.you have to worry about more and a wider variety of endings).
- Can we build a model to test the differences we see in the data?
Spoil your appetite and skip to the answers.
Clean the data
So before doing anything else let’s load some relevant libraries and take a peek at the data.
library(data.table)library(ggplot2)library(Rmisc)library(stringr)library(stringdist)library(lme4)library(SnowballC)#Data can be found here: https://github.com/duolingo/halflife-regression#fread is a faster means of loading a big dataset, which we know this will bedata.raw = fread('bigset.csv')
## Read 0.0% of 12854226 rowsRead 7.8% of 12854226 rowsRead 16.6% of 12854226 rowsRead 25.1% of 12854226 rowsRead 32.0% of 12854226 rowsRead 38.7% of 12854226 rowsRead 44.7% of 12854226 rowsRead 51.3% of 12854226 rowsRead 58.9% of 12854226 rowsRead 69.1% of 12854226 rowsRead 79.7% of 12854226 rowsRead 88.9% of 12854226 rowsRead 98.3% of 12854226 rowsRead 12854226 rows and 12 (of 12) columns from 1.219 GB file in 00:00:17
And let’s take a brief look at the data…
## p_recall timestamp delta user_id learning_language ui_language## 1: 1.0 1362076081 27649635 u:FO de en## 2: 0.5 1362076081 27649635 u:FO de en## 3: 1.0 1362076081 27649635 u:FO de en## 4: 0.5 1362076081 27649635 u:FO de en## 5: 1.0 1362076081 27649635 u:FO de en## 6: 1.0 1362076081 27649635 u:FO de en## lexeme_id lexeme_string## 1: 76390c1350a8dac31186187e2fe1e178 lernt/lernen<vblex><pri><p3><sg>## 2: 7dfd7086f3671685e2cf1c1da72796d7 die/die<det><def><f><sg><nom>## 3: 35a54c25a2cda8127343f6a82e6f6b7d mann/mann<n><m><sg><nom>## 4: 0cf63ffe3dda158bc3dbd55682b355ae frau/frau<n><f><sg><nom>## 5: 84920990d78044db53c1b012f5bf9ab5 das/das<det><def><nt><sg><nom>## 6: 56429751fdaedb6e491f4795c770f5a4 der/der<det><def><m><sg><nom>## history_seen history_correct session_seen session_correct## 1: 6 4 2 2## 2: 4 4 2 1## 3: 5 4 1 1## 4: 6 5 2 1## 5: 4 4 1 1## 6: 4 3 1 1
## Classes 'data.table' and 'data.frame': 12854226 obs. of 12 variables:## $ p_recall : num 1 0.5 1 0.5 1 1 1 1 1 0.75 ...## $ timestamp : int 1362076081 1362076081 1362076081 1362076081 1362076081 1362076081 1362076081 1362082032 1362082044 1362082044 ...## $ delta : int 27649635 27649635 27649635 27649635 27649635 27649635 27649635 444407 5963 5963 ...## $ user_id : chr "u:FO" "u:FO" "u:FO" "u:FO" ...## $ learning_language: chr "de" "de" "de" "de" ...## $ ui_language : chr "en" "en" "en" "en" ...## $ lexeme_id : chr "76390c1350a8dac31186187e2fe1e178" "7dfd7086f3671685e2cf1c1da72796d7" "35a54c25a2cda8127343f6a82e6f6b7d" "0cf63ffe3dda158bc3dbd55682b355ae" ...## $ lexeme_string : chr "lernt/lernen<vblex><pri><p3><sg>" "die/die<det><def><f><sg><nom>" "mann/mann<n><m><sg><nom>" "frau/frau<n><f><sg><nom>" ...## $ history_seen : int 6 4 5 6 4 4 4 3 8 6 ...## $ history_correct : int 4 4 4 5 4 3 4 3 6 5 ...## $ session_seen : int 2 2 1 2 1 1 1 1 6 4 ...## $ session_correct : int 2 1 1 1 1 1 1 1 6 3 ...## - attr(*, ".internal.selfref")=<externalptr>
Ok, wow. We have almost 13 million datapoints from over 115 thousand users learning 6 languages, as well as information about every word learned. Let’s look more at what’s in this dataset.
Each line of the dataset is a word for a given user, for a given session. So the first line is the word “lernt” (seen in the lexeme_string) for user u:f0 for some session of German. In this particular session they’ve seen the word twice (session_seen) and gotten it right twice (session_correct). Before this session they’ve seen the word 6 times (history_seen) and gotten it right 4 times (history_correct).
The lexeme_string variable has a lot of juicy information. First we see the surface form, which is the word in question, as it appears. After the /, we see the lemma, which is the base form of the word (unchanged to note person or tense or anything like that). Then in the first set of <>, we have the part of speech, and following that we have a lot of information about how the word is modified - so lernt is a lexical verb, in the present tense, third person, singular (in that order).
Because this data set is massive and R isn’t super for such datasets, we need to clean up the data and add some variables to make it as memory and time efficient as possible.
#Removing all lines with NA data and non-English learners for simplicitydata.raw = data.raw[complete.cases(data.raw),]data.raw = data.raw[data.raw$ui_language == "en"]#Getting rid of variables that mean nothing to usdata.raw$timestamp = NULLdata.raw$lexeme_id = NULL
So we’ll be focusing on English speaking users learning German (de), Spanish (es), French (fr), Italian (it) and Portuguese (pt).
First, I want to find the total number of times a person has seen a word or gotten a word correct, across session. To do this I need to find the highest value for history_seen, (for each word and for each person) and remove all rows that aren’t the max for each person and word. This will show us each person’s last session for a given word. We will then add that current session information to the history information to calculate a total for each word.
#Create a temporary factor that includes both a user_id and a lexeme_stringdata.raw$temp = as.factor(paste(data.raw$user_id, data.raw$lexeme_string, sep = "_"))#Remove all rows that aren't the max valuedata.reduced = data.raw[data.raw[, .I[history_seen == max(history_seen)], by=temp]$V1]#Create total_variablesdata.reduced$total_seen = data.reduced$history_seen + data.reduced$session_seendata.reduced$total_correct = data.reduced$history_correct + data.reduced$session_correct#This one was especially fun to namedata.reduced$total_recall = data.reduced$total_correct/data.reduced$total_seen
Next we’ll aggregate over subjects. By averaging every subject’s response to a lexeme_string, we significantly reduce the size of the dataset, having only an average value for each lexeme string produced. We lose some variability due to averaging, but the dataset is so big that shouldn’t be a problem.
data.reduced = data.frame(aggregate(cbind(total_seen, total_correct, total_recall)~ lexeme_string+ learning_language, data = data.reduced,mean))#make learning_language a factordata.reduced$learning_language = as.factor(data.reduced$learning_language)#Peekhist(data.reduced$total_recall)
So this is definitely pretty skewed, but nothing strange considering Duolingo is designed to get people to high levels of recall over time.
Add a lemma variable
That aggregation makes our data MUCH more managable, reducing our dataset to about 1% the original data size. Now we’ll add a lemma column. The lexeme_string contains the information about the word that we want to extract. While the first part, the surface form, isn’t that important to us, the “lemma” is. It’s the base word that we’ll be working with. The lemma is the word built into the lexeme string after the / and before the part of speech noted with <. So we’ll simply tell R to check every lexeme string, and extract the characters after the / and before the first <.
#This removes all information before the lexemedata.reduced$lexeme_string = gsub("^.*?/","/",data.reduced$lexeme_string)data.reduced$lemma = substr(data.reduced$lexeme_string, 2, as.numeric(lapply(gregexpr('<',data.reduced$lexeme_string),head,1)) - 1)
Add item and cognate status variables
Now I want to add a column that contains all of the information in the same language. For example, it’s better for analysis if I can represent the word chien as a vector containing the semantic information and the language - something like
#CSV containing all of our translations.trans = read.csv('translations.csv', encoding = "UTF-8")#Add a column combining learning_language and lemma, so that we can match the two documents togetherdata.reduced$ll_lemma = paste(data.reduced$learning_language, data.reduced$lemma, sep = "_")trans$ll_lemma = paste(trans$learning_language, trans$lemma, sep = "_")#We'll add the actual item column in a minute
Additionally, I think cognate status could impact learning. A cognate is a word that is the same, or very similar between two languages (like animal in Spanish and English). We know that cognate improves word learning, so I wanted to implement a simple measure of cognate status, using levenshtien distance.
#Cognate status## stringsim calcuates the minimal number of deletions, insertions, and substitutions that can change one word into another.trans$cognatestatus = stringsim(as.character(trans$item),as.character(trans$lemma))data.reduced$cognatestatus = with(trans, cognatestatus[match(data.reduced$ll_lemma,ll_lemma)])#Peek at cognatestatushist(trans$cognatestatus)
Cognatestatus looks pretty good. Many words have next to no letter overlap (low values), but others have a pretty even distribution.
#Here I trim the endings off the translations to make sure plural words are not considered different than singular words.##This is the simplest way to do that without manually editing all translationstrans$item = wordStem(trans$item, language = "english")#Add item categorydata.reduced$item = with(trans, item[match(data.reduced$ll_lemma,ll_lemma)])data.reduced$ll_lemma = NULL
Add a part of speech variable
Now let’s add a couple simple variables that might help us capture differences in data. I’d like to extract the part of speech of each word, located in the lexeme_string, found in the first set of <>.
data.reduced$pos = substr(data.reduced$lexeme_string, as.numeric(lapply(gregexpr('<',data.reduced$lexeme_string),head,1)) + 1, as.numeric(lapply(gregexpr('>',data.reduced$lexeme_string),head,1)) - 1)
Now I’d like to simplify the parts of speech variable…It would be too overwhelming for me to compare every category to one another. Categories like nouns will have big Ns, but the verbs, adjectives and other category words are broken down into lots of subsections that I want to aggregate together. To do this I used the lexeme_reference.txt found on Duolingo’s github and edited it to make simpler categories. All of the categories were distilled into Nouns, Verbs, Function words (like the, on, with etc.) and describer, which is a category I just made up to cover things like adjectives and adverbs.
lexref = read.csv('lexeme_reference.csv')#Add simplePos based on the POS from the lexeme reference guidedata.reduced$simplePos = with(lexref, Type[match(data.reduced$pos,pos)])data.reduced = data.reduced[complete.cases(data.reduced),]#Remove intricate part of speech variabledata.reduced$pos = NULL#Remove "other"" itemsdata.reduced = data.reduced[data.reduced$simplePos != "other",]
Add Number of Modifiers variable
Next, I want to know how complicated a word is. Surely a word with more modifiers should be more difficult (e.g.a basic noun should be easier to get correct than a basic noun that has a bunch of endings denoting female,plural,accusative case etc. (Yeah, I’m looking at you German.)) To do that I’ll simply add a variable that counts the number of < characters, as each modifier adds a left bracket. While it’s true that all lexeme_string values have at least one modifier, it shouldn’t matter so long we consider this value only relatively.
#Add number of modifiersdata.reduced$NoMod = str_count(data.reduced$lexeme_string,pattern = "<")#Peek againhist(data.reduced$NoMod)
So the distribution looks pretty reasonable. There’s some outliers with ~14 modifiers, but there are really very few of them so we’ll let them be.
Look at the data
Ok whew. We’ve reduced, added and altered our data down to 16K observations of 13 variables. Now let’s have some fun!
So our first question was to see which languages are harder or easier. The simplest, easiest, grayest way to do that is like this:
#create summary statistics tablesumstat = summarySE(data.reduced, measurevar = "total_recall", groupvars = c("learning_language"))#Generate simple bar graphggplot(sumstat, aes(x=learning_language, y=total_recall, fill = learning_language)) + geom_bar(stat="identity", position=position_dodge(), size=.3, aes(fill = learning_language))
So they look pretty similar like this. There’s a hint that Italian and Portugese are easier than the rest, and that French is the hardest.
Fortunately we can take a more nuanced approach by using the variables that go into total_recall, namely total_seen and total_correct.
ggplot(data.reduced, aes(x = total_seen, y=total_correct, color = learning_language))+ geom_point()+ geom_smooth(method = lm,aes(color = learning_language))+ geom_abline(intercept = 0, slope = 1, linetype = "dashed")+ scale_x_continuous(limits = c(0,100))+ scale_y_continuous(limits = c(0,100))
Now we’re cooking. First, note that the dotted line represent a perfect score. It means they’ve gotten the word right every time it has been shown. Anyway, based on this linear regression, French is definitely the hardest language, insofar as it takes more instances of seeing a certain word before reaching the same number of correct productions as one of the other languages. For example, just by looking at the regression lines, we can see that for a word to be produced 30 times correctly in French, a user needs to have seen it ~60 times, but for 30 correct productions in German, a word needs to be seen ~40 times.
Now our second question was about what kinds of words are harder or easier to learn. And for that we will use a nearly identical graph, but this time we separate by our simple parts of speech metric.
ggplot(data.reduced, aes(x = total_seen, y=total_correct, color = simplePos))+ geom_point()+ geom_smooth(method = lm, aes(color = simplePos))+ geom_abline(intercept = 0, slope = 1, linetype = "dashed")+ scale_x_continuous(limits = c(0,100))+ scale_y_continuous(limits = c(0,100))+ scale_color_brewer(palette = "Dark2")
Here we see a relatively simple pattern that verbs are most difficult to produce correctly, with nouns and describers coming second, and the function category being relatively easier.
I wonder if this varies between languages, since the french words seem to comprise most of the nouns and verbs that are harder.
ggplot(data.reduced, aes(x = total_seen, y=total_correct, color = learning_language, shape = simplePos))+ geom_point(aes(shape = simplePos))+ geom_smooth(method = lm,aes(color = learning_language))+ geom_abline(intercept = 0, slope = 1, linetype = "dashed")+ scale_x_continuous(limits = c(0,80))+ scale_y_continuous(limits = c(0,80))+ facet_wrap(~learning_language, ncol = 5)