Published Dec 05, 2020
It’s been just a while since my internship with Wikimedia started in Outreachy and I am already learning a lot! In this blog, I will be sharing what my project is all about and a couple of things I have learned or re-learned in these days and are some common technologies that many other open-source networks use as well.
This post is going to be quite long as I will be sharing my progress as I go through the internship. Some parts of it will be useful for people looking for internship and want to familiarize with what open source encompasses, or simply people looking to join the open source fun!
Some things to know about in general as a SE. Or to join open source.
In short, the beginning of a huge change. We all know about Wikipedia and how it has been helping millions of people around the world get access to free and open knowledge. And more importantly how anyone around the world can contribute to this endeavour through adding, correcting, or updating information in the wikis. But did you know Wikipedia is a project under the Wikimedia foundation?
The Wikimedia Foundation Inc. is the parent organization free-content projects, most notably Wikipedia, the award-winning online encyclopedia. The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally.
Other projects include Wikitionary, Wikibooks, Wikinews, MediaWiki (the software that runs it all), Wikidata, Wikimedia commons etc.
Wikipedia is free and the largest encyclopedia hosted in lots of different languages. The way it roughly works at the moment is
This is where Abstract Wikipedia comes in.
What if we did not store information in each and every language separately? What if we had a notational way to store data in a ‘single’ place? All languages could use this data and generate wiki pages in respective languages. This can increase the richness of so-many wiki languages, maintain consistent updated information and most importantly this will pool efforts of people of all languages into one platform, making knowledge gathering and dissemination across languages a breeze.
This is the Abstract Wikipedia project. Super cool! To know about it in more detail, see this.
The goal is to allow everyone to contribute and read content in Abstract Wikipedia, no matter what languages they understand, and to allow people with different languages to work on and maintain the same content.
As I briefed above, Abstract wikipedia will require a knowledge base and a way to turn that knowledge into any natural language. The knowledge base is called Wikidata and the process to make language is through Wikilambda.
Wikilambda is a wiki of functions. For Abstract Wikipedia, we will need functions that take abstract content as the input and return natural language text as the output. Wikilambda aims at making access to functions and writing and contributing new functions as simple as contributing to Wikipedia.
See more details in the paper I talked about previously. But basically Wikilambda will store information in a structural form, indicating how the various things in a sentence are related. For example, ‘San francisco is the fourth-most populous city in California, after Los Angeles, San Diego and San Jose’. We can express it as a function:
Article(
content: [
Ranking(
subject: San Francisco (Q62),
rank: 4,
object: city (Q515),
by: population (Q1613416),
local_constraint: California (Q99),
after: [Los Angeles (Q65),
San Diego (Q16552),
San Jose (Q16553)]
)
]
)
Here each name/term is denoted by a id (like Q99) and thus can be identified across languages. Now if I were to extract this information in Bangla, I would call this function and ask for a output in Bangla. Or any other language for that matter. But how can a machine know how to generate language from words and their relations?
Did I mention that the key-values for the ids (like Q99) are stored in wikidata? Wikidata also holds information about languages (lexicographic knowledge) to be able to render sentences in any language from the Wikilambda functions. Wikidata is already used in Wikipedia to some extent - to create infoboxes or extract certain pieces of information - but not to generate language, yet.
The way wikidata can be populated with language info is with community effort. Various communities will have to provide lexicographic knowledge of their respective languages, ultimately creating a rich ecosystem of information flow. Notice, the amount of explicit language information required with Abstract Wikipedia will be much-much less than amount we have to build up in the current wikipedia, i.e for every topic in out language.
This is not to say that a sudden breaking change will occur when Abstract Wikipedia comes into play. In fact, Abstract Wikipedia will join in with the existing multi-lingual wikipedia and build on it enormously and swiftly.
Phabricator T263678
Analyze community authored functions that build Wikipedia infoboxes and more
Finding the different community authored functions that are out there and help prioritize which ones would be good candidates for centralizing for Abstract Wikipedia and its centralized wiki of functions.
Tasks:
Find me here:
And our work here:
To set up my existence in Wikimedia I had to follow a few steps. This section is not for anyone to follow but a mere report of what I have been doing. One of the best things I find in wikimedia is the organization and instructions. There are instructions for everything. Anything I can conceivably need to start contributing to wikimedia, they have a wiki of instructions and troubleshooting for it. This just makes life so much more easier.
First is to create a bunch of accounts. This seemed a little confusing at first, but various domains on wikimedia require separate account creation and thus separate user pages. User pages are those that end in User:your_name
.
meta.wikimedia.org
, it will show up in en.wikipedia.org
. I did create one in mediawiki separately to post updates about the internship. My user pages are:
phabricator account
, and connected it to both my wikimedia and wikitech accounts. Guidelines here. We had to set up 2 factor here as well. But phabricator does not provide any sort of scratch codes, so loosing phone is going to be a bummer. What you do is set a committed identity in your wiki user page so that if you get locked out of phabricator you can be recognized through your committed identity hash. Here are some more info.All you need to know about git and gerrit for wikimedia is already covered in their wikis. Here’s the gerrit starter. All links from this page are important, especially the How to become a wikimedia hacker page. Follow the instructions to set up and use gerrit. Keep note of some of the frequently required commands like the command to push changes to Gerrit is git review -R
or git review <branch name>
etc. Finally set up docker and run the unit tests.
Get to know a little more about ssh. Here’s how to create a ssh key pair and use in gerrit. But for general purpose usage, see some more. Here’s another example of creating a ssh key pair and using it in toolforge.
IRC or internet relay chat is a form of communication through pure texts and is popular among open source communities. There can be various IRC clients, either on your PC or online. You can connect to various servers and channels within those servers as you wish. To create a permanent account you will need to set up a password so no one else can use your name. Follow the guide to know more how to set it up. Note that all chat history is available as a text files. You can only see chat history on screen only after you have joined IRC.
We used the freenode server (chat.freenode.net:6697). I used irccloud
, a web based client to connect to various channels of this server. The web option does not remain connected on free version, so I installed Polari, a desktop app for the same. I created a account by simply setting a password so no one can impersonate me. See instructions to use IRC here and more about it here. List of Wikimedia IRC channels here.
Wikimedia has a LOT of mailing lists. I subscribed to some of those which concern my work over there. Like abstract-wikimedia, research and wikitech. The list of all mailing lists are here. Anyone can join these lists to be updated on issues of their interest. I can basically get e-mail updates and also send information to everyone in the list through it.
I have divided these materials into things about wikimedia and things to know to join wikimedia.
Toolforge: Created a toolforge account (logged in with wikitech account) and followed the quickstart quide. Had to go into several depths of links to create account, familiarize with how toolforge works and how to get started with it. See VC in toolforge and see Toolforge from GitHub after you have cloned your Git Repo into your tool.
How to work with toolforge:
ssh -i ~/.ssh/id_rsa <unix shell username>@login.toolforge.org
and become MY_TOOL
. exit
to get out.Bryan Davis tanny411: you can only connect with ssh/scp as your user, not as a tool user. So you would use your personal username, but could get files in/out of a tool by targeting the tool's $HOME with the location (meaning something like `scp bd808@login.toolforge.org:~tools.my_tool_name/some/file/path .`)
jstart -N jobname venv/bin/python3 my_script.py
jsub [...options] python3 test.py
or jsub script.sh
. For more options see sun-grid-docs.script.sh
. This script can activate python env, and then call the python file with arguments easily.crontabs
see this. Basic usage:
crontab -e
# at the end of this file add the following to run at 8AM UTC everyday
0 8 * * * test-script.sh
Database. There are lots of ways to access the databases. From toolforge, PAWS, Quarry etc. Connect from toolforge by sql enwiki_p
for example.
SHOW databases;
or here. Our required one is called mediawikiwiki_p
.user_properties
, interwiki
and text
tables are not available as is.become my_tool
sql tool
## find your credential by:
SELECT SUBSTRING_INDEX(CURRENT_USER(), '@', 1);
CREATE DATABASE CREDENTIALUSER__DBNAME;
Working with Wiki replicas databases and datasets
.Connecting to database from local computer. Basically set a relay port in your pc, to which if we hit, it gets it on the wiki dbs.
## copy over replica.my.cnf file
## for all db connections you want, do: -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT to this line
## different local ports (e.g. 4711, 3308 etc)
ssh -N yourusername@login.toolforge.org -L 4711:enwiki.analytics.db.svc.eqiad.wmflabs:3306
mysql --user=$USER_FROM_REPLICA.MY.CNF --host=127.0.0.1 --port=4711 --password enwiki_p
Example use of database through toolforge using python:
## my terminal
ssh -i ~/.ssh/id_rsa tanny411@login.toolforge.org
## toolforge terminal
become test-tool-aisha
## my tools terminal
python3 -mvenv my_venv
source my_venv/bin/activate
## from venv
pip3 install --upgrade pip
pip install toolforge
python3
## code in python3
import toolforge
conn = toolforge.connect('mediawikiwiki_p')
with conn.cursor() as cur:
cur.execute("SELECT * from user_groups LIMIT 2;")
for x in cur:
print(x)
To create a user database you have to connect to tools
database and then create database. May want to have multiple connections in your code. For our purposes we will have to collect data from API, mediawikiwiki_p and store in our user database.
Use the action API to fetch all Scribunto modules for all wiki content, in parallel, using the grid. Update these fetched data regularly. Enhance api queries with queries against the database replicas. Page contents (what we really want) is to be extracted with API as it’s not available in database.
You can maybe look at this in the other direction, though. It might be easier to do SQL queries via script to determine more reliably without API pagination the full list of modules to obtain, and then utilize the API or action=raw access to get everything you need for the actual Lua. It’s also completely fine to just use the Action API for all of it!
Then we are to find the relationships between modules, data about module usage and what are the most important modules. Find and remove redundancy if possible and centralize similar codes. These are among the first steps to make wikipedia functions language independent towards the goal of Abstract Wikipedia. For example, see a function in English and Bangla. See for more languages from the language section of the left tab.
All Scribunto modules are in Modules
namespace.
what links here
on the left tab.Pages transcluded onto the current version of this page
. This lists modules used in this page even indirectly, for example through a template.828
.Some learning points when I tried to run and check huge amounts of data. I had to make very careful decisions on how to write my code. Especially in pandas, when loading csv. Let alone wiki contents (~1,8G!), only page_id and wiki link wouldn’t load (15M + 18M), there’s that many pages across all wikis! I had to:
del
dfs when done. Had to write code such that no df needed to be copied, even partially, to make any operation.inplace
options when possible.Later, when I ran into issues due to bad data, I couldn’t find any easy solution. I wanted to ignore bad data but pandas ignores when there is more columns than necessary, not less. Normally I would check the raw data to see whats going on and why the bad data was created in the first place. But large files! After much maneuvering with pandas docs, I ended up simply doing a grep
looking for the piece of data that was causing the error! Sometimes I wish easier solutions ring a bell faster in my head.
csv
files had to be abandoned. Also loading from csv created issues due to file size, unprecedented symbols (comma, quotes) in contents etc. All these and more can be solved by creating a user db in toolsdb. With this end all file read and writes were transitioned to databases. Work progress in database-refactor
and content-fetcher-database
branches. The code was cleaned such that revision can also be done with the same script. For example, when latest page content is already available its writing will be skipped, else it will be written. It would have been very hard to accomplish if things were written in files.To run large processes we use Grid and set cronjobs for regular calls. To test with small amount of data we can run python scripts in toolforge (to get database access). But testing and checking code becomes a pain if every small change has to be made locally, then transferred to toolforge and then run. For this the solution is to set up database access locally. Follow the database_setup.md
file to set up port forwarding from your PC to toolforge database.
ssh -N <username>@dev.toolforge.org -L 1147:tools.db.svc.eqiad.wmflabs:3306
Here username is your toolforge username, replica_user is your tool. See your tool username and password from the replica.my.cnf
file. Make sure to create a connection for each db you are going to access with different ports (1147, 1148 etc). Multiple connections can also be set in one command. See toolforge:Database for more.
Now to connect to it from terminal by
mysql --user=<replica_user> --host=127.0.0.1 --port=1147 --password <replica_user>__data
or with python (to run your scripts locally)
conn = pymysql.connect(host='localhost', port = 1147, user='<replica_user>', db='<replica_user>__data', password='<replica_pass>')
pymysql
as with the toolforge library. This requires us to iterate through the dataframe and insert every row separately. An alternative and faster way is to use sqlalchemy
and use df.to_sql()
. See this, this and pandas docs for more. But we need to consider what will happen with multiple crons do bulk inserts. Maybe not a good idea as db may remain locked for too long.Some frequent commands:
## update refs
git fetch --all
## force pull (sometimes you need to stage everything before this)
git reset --hard origin/<branch_name>
## checkout and create from remote branch
git fetch --all
git checkout --track origin/<branch_name>
## stash (conflicts retains the stash on pop)
git stash
git pop # applies latest
git stash list
git stash apply stash@{stash_number}
## Mistakenly commit (maybe also push) something?
git reset --soft HEAD~2 # goes back 2 commits
## then commit and force push if you've pushed before
## See branch infos
git branch (see list of branches)
git remote show origin (see who is tracking who)
## List ssh ports being used
ps aux | grep ssh
## list all cronjobs full values (cuz names get truncated sometimes)
qstat -xml | tr '\n' ' ' | sed 's#<job_list[^>]*>#\n#g' | sed 's#<[^>]*>##g' | grep " " | column -t
Database exploration. Work progress here. I started out by trying to understand what the data in the tables represent and how they might be useful for us. There were lots of tables and it took some time to figure out the differences between the pagelinks, langlinks, templatelinks and iwlinks tables. I explored the source of wiki pages to find how the modules are used (by #invoke) which is different from linking a module, which just means the module is being talked about, either in talk pages, or places to link to the module page for reference. I found that mostly modules are used in template pages and the list of transclusions of any wiki lists all the pages used in it, wither directly or indirectly. So that saved us some work as I simply queried the templatelinks table to find places a module is being used.
Completed my work on collecting relevant data from databases. My initial hunch was that collecting info from databases is going to be fast because:
Learning points and struggles:
WITH
clause, so in some places I had to repeat a sub-query which a query, where the sub-query itself was quite expensive (joining 4 tables!).To investigate transclusions and pagelinks here’s what I did:
what links here
I learned about interwiki prefixes and how they work. Sadly the interwiki table is empty in replicas, so I created a script to fetch and populate this table this table in user database through the API. I landed on this mapping along the way.
chunk_size
should give me a generator, which I thought would fetch data from db with read_sql
in chunks and I’d save them. But it seems sometimes pandas tries to load the entire data into memory and gives us chunks from there! I mean seriously, whats the point? See this issue. I inially did not know this and went crazy with why I am getting the errors. Note that I did not get explicit memory errors, I got segmentation fault (core dumped)
. So I was also confused it was a memory error at all, although I was pretty sure that was it. Finally I used limit offset
in SQL instead of df chunk size, slow but works. Time-Memory tradeoff.yield
keyword, that function is now a generator and so it failed to do other things. And the worst part was that it wouldn’t raise any errors, so I was banging my head on the table trying to figure out why my function isn’t even being called! I used vscode debugger and what not.eval()
in python with which I could send in the name of a function in string and call it like eval(func_name)(**params)
.conn.close()
statements in the finally
block of try-except
.Some additional work:
The most important information for us is
These information are yet to be explored by data analysis. So far the number of transclusions and transcluded-in are being saved in the user database.
Finalized collecting data from databases. I found out that collecting data from teamplatelinks (~9 hours) table and iwlinks(3+++ days) table was take too long. The query for collecting transclusion information from teamplatelinks table was not too complex but it was taking that long because this table itself is super huge. On the other hand to collect information about iwlinks was very complex, involves copying a ‘whole’ table over to toolsdb repetitively. After discussion with my mentor we decided that we can stop that query for now because its was running for over 3 days and the data it provides may not be THAT important afterall. We can collect them later if required. Adam also suggested that I don’t use the DISTINCT
keyword if it doesn’t change the data, as it might be causing a lot of delays in query. In fact it was redundant as the page_id’s were distinct anyways, so I removed this keyword from couple of queries but it did not seem to speed up queries like we wanted.
I was getting a bunch of MySQL Lost connection during query
. I tried reducing the number of rows returned from query from 500 to 300 to 100, the problem reduced, but persisted. I googled a lot and found that it may be either due to memory overload (which I had already handled) or timeout (which seems to be set to default = 8 hours). Finding no other clue, I decided to set up retry system in my code, so that on loosing connection, the code with automatically retry at least 3 times before failing. Moreover, I also set up scripts to get the missed information after other scripts are done. While setting this up I had to brainstorm and figure out how to save what got missed but not fall into infinite loop (when the script to fetch missed stuff also misses something). During this I learned about efficient usage of try-except-finally block, especially the finally
block.
Next job was to get pageviews. That was a new hurdle on its own. The API I have been using so far gets up only 60 days of pageviews. So I looked into the REST API (which is what the previous API uses anyways). This one has data from July, 2015! See docs here. But unfortunately I could not get to work because calling it too much quickly gives timeouts and lots of 404s. We can visualize pageviews using the Pageviews tool though. BUT, last 60 days of REST API vs PHP API don’t match. Probably due to heavy caching of php API. I also ran into some errors using the php API, then Jade helped me figure out that it was a issue with the python toolforge library when setting user_agent (very unexpected place to cause errors). It was informed to them and they corrected it right away.
Learning Points: Some additional things I got to learn along the way in week 6 are:
WHERE
clause. LIKE
seems to do case-sensitive searches but it’s results were not consistent from scripts and terminal for some reason. I searched and learnt about collations in MySQl tables but could not accomplish much with these information. At last I stumbled upon the simplest solution - WHERE col1 = BINARY col2
. Notice that this performs much faster than BINARY col1 = col2
.On week 7, I started data analysis of the various numeric data from the user-table. The first impression is that there are lots of nulls, which I analysed by looking into respective wiki database tables and concluded that certain columns in our user database need to be modified to have default value 0. Next most important observation is that the data is HIGHLY skewed. The only way to visualize anything is to plot in log scale, which is hard to internalize the extent of the data in reality. So I viewed the data in small intervals and for each column I tried to find some very basic initial heuristic to identify what modules are important. Like modules with number of transclusions > 1M are probably important, cause they are outliers. I wanted to do some interactive plotting so I could zoom in and view whatever I want, but with this amount of data, its too heavy and does not work :( . My data analysis is still going on, so more update can be found next week.
As for pageviews, the script is still running to fetch pageviews for all data for 60 days. Since it takes multiple days to run, I changed the script to fetch weekly instead of daily for later runs. But I think I might have to change it to fetch monthly instead! Adam suggested getting pageviews from dumps which has pageviews for more than 60 days and might be easier to work with as it is one big file. But it might be time consuming and may not be 100% accurate depending on how I parse it. For now, since we can work wit 60 days data pretty well, I am holding off parsing the dumps. After data analysis is over, if I see that pageviews from API isn’t working, I will start looking into the dumps.
On week 7 I also did some error removal from my fetch data from database script. When all the cronjobs run, I got some deadlock
and lost connection
errors. To fix this I set up my code to loop through a few retries before throwing errors. I optimized my cronjobs more and set up final runs during the weekend. But it seemed some cronjobs that took only and hour or so were taking the while weekend to run, it seemed fishy. So I stopped them and saw that my code was retrying forever! The deadlock error was resolved but lost connection was not solving even after retying for SO MANY times. Anyways, I debugged my code for half a day and couldn’t figure out why the error is not getting caught. I asked Jade to look into it and she ran it with a simple code and said error were actually being caught! So the problem was with pd.read_sql
which was eating up the error and just printing the error message as a string! I searched a lot to find how to catch error from pandas, but it seems that does not work for pymysql. So I decided to not even use read_sql
. I used a normal cur.execute()
and then saved the data as a dataframe (had to make multiple adjustments, ike sending column list in each function). Finally I also changed up my error handling to have nested try-catch
blocks and that finally fixed the retry thing I was trying to do. The database still gives lost connection in 2 wikis (although I increased timeout time, and tested with returning just 1 row!)
Week 8 Update:
mysql connection lost error
does not occur in PAWS and Jade tested it locally, it works there too! Created bug report in Phab T272822 and asked in IRC.X
rights can edit/move pages with X
edit/move protection. Read up and tried to understand this more. Concluded that this isn’t something universal across wikis, so I stuck to old page protection values as those are the majority.browser inspect
and pasted in my markdown cell.Week 9 Update:
mysql connection lost
errors was that we were running it on web
instead of analytics
cluster as defaulted by toolforge python library. Tested by running all scripts, it seems to work now.Week 10-11 Update:
Started applying the findings from data analysis to find ‘important’ modules. As evident from my data analysis, the data we have is heavily skewed. We want pages with more transclusions/ more page links and so on, to be counted as important modules. But the number of such modules is very low and lost within in the 99.99999…th percentile. To fix this I changed the distribution slightly and regared the-percentile-a-value-is-in as the score. See more details of how I did that in the phabricator task T272003. Finally a module-score is calculated as a weighted sum of feature-scores. The weights can be altered by the user to prioritize number of lang links over transclusions for example.
Week 12-13 Update:
After weekly discussion, Jade and I decided to split our tasks once again. I continued working on similarity analysis that Jade had started. Jade on the other hand started building the web interface. Week 12 was a hectic week. I was buried in tons of experiments and had to come up with a way to find out modules similar to each other. Jade started by using edit distance (aka levenshtein distance) as features and performed DBSCAN clustering on it. The problems were:
n x n
memory to create the distance matrix. Wouldn’t be possible with our ~300k modules.To fix these I started out fresh and looked for other ways to make features. After a lot of experiments (see details in phabricator task T270827) I decided to go with FastText word embedding (size 32) as features and OPTICS clustering algortihm. OPTICS is simialar to DBSCAN but optimized for space (so we can run clustering for all the modules at once). Unlike DBSCAN, which is optimized for speed. Next, I fixed the high number of noise that the algorithm detects by tuning the algorithm a bit, creating some pseudo-clusters from the noise, and finding ways to relate the clusters themselves. All of these are documented in a pdf uploaded in the phabricator task.