Mannheim Web Panel (MWP)

Site content by Julian Oliver Dörr and Sebastian Schmidt

This site introduces the Mannheim Web Panel (MWP) – a panel dataset of website contents extracted from corporate websites for a large sample of German companies.
The MWP was developed at the ZEW – Leibniz Centre for European Economic Research in the Project Business and Economic Research Data Center (BERD@BW).

Why corporate websites?

Company websites pose an important source of economic data used by firms to spread product and service information (related to establishing a public image), to conduct transactions (e-business processes) and to ease opinion sharing (electronic word-of-mouth) (Balzquez & Domenech, 2018). Recent economic studies have used corporate website data to:

Why panel data?

Firm characteristics, diffusion processes such as technological advances and technological adoption as well as business relations are clearly not static but evolve over time. It requires a continuous monitoring of corporate websites to capture this information. For this reason, the ZEW – Leibniz Centre for European Economic Research scrapes corporate website contents since 2018 and has established a panel format of these contents updated every three to six months.

Is the data available for researchers?

The MWP data is stored in ZEW’s cloud structure (Seafile) and can be accessed by externals for the purpose of research via ZEW’s Research Data Centre (ZEW-FDZ). Upon signing a licence agreement, interested users will have full access to the MWP in a secured environment provided by ZEW. For more details please contact:

How is the data structured?

The MWP includes a large amount of web data from corporate websites of German companies. The general scraping framework used to establish the MWP is available on Github. The scraping parameters for the MWP are standardized. For each website, the first 50 subpages are downloaded with shorter URLs in the corporate web-domain scraped more likely. Webpages which are in German language are preferred in the scraping process such that the majority of text content in the MWP is in German.

In the following, we will describe the data structure and access to the panel in more detail using Python. Note that the data can be accessed by any other programming language as well. The Seafile client that allows you to access the MWP from your local machine corresponds to “Q:\” drive in the following. Set the relevant drive on your machine in the cell below:

drive = "Q:\\"

In total, the MWP includes approximately 740 GB of data (as of April 2021) and information on more than 2.7 million firms. However, only a part of these firms is included in each wave. Thus, the MWP comes with a file showing for each firm whether it had been scraped successfully in the respective wave (in the table below 1 indicates a sucessful scraping attempt of the respective corporate website). In general, web data is available for the following dates:

  • December 2018
  • April 2019
  • August 2019
  • December 2019
  • March 2020
  • May 2020
  • August 2020
  • October 2020
  • January 2021
import pandas as pd
overview = pd.read_csv(drive + r"Für mich freigegeben\Mannheimer Webpanel (MWP)\Übersicht.csv", delimiter='\t')
overview.head(5)
IDurl2018_122019_042019_082019_122020_032020_052020_082020_102021_01
02010000001www.wiener-conditorei.de111111111
12010000057www.hoernicke.de111111111
22010000074www.psi.de111111100
32010000125www.scherdel.de111111101
42010000144www.kleinschmidt.de111111100

5 rows × 13 columns

Each wave is stored within a corresponding directory in Seafile and is split up in multiple chunks (generally about 100-150 MB each). These chunks are csv-files containing the scraped webiste information including:

  • ID: ID of the company
  • dl_rank: chronological order the webpage was downloaded (main page = 0)
  • dl_slot: domain name of the website
  • alias: domain of website if there was an initial redirect (e.g. from www.example.de to www.example.com)
  • error: indicating if there was an error when requesting the website’s main page (e.g. HTML error, timeout)
  • redirect: equals “True” if there was a redirect to another domain when requesting the first webpage from a website
  • start_page: first webpage that was scraped
  • title: title of the website as indicated in the website’s meta data
  • keywords: keywords as indicated in the website’s meta data
  • description: short description of the website as indicated in the website’s meta data
  • language: language of the website as indicated in the website’s meta data
  • text: text that was downloaded from the webpage, including the respective HTML tags
  • links: domains of websites (within-sample and out-of-sample) found on the focal website
  • timestamp: exact time when the webpage was downloaded
  • url: URL of the requested webpage

For some of the older waves, information on title, keywords, description and language are not included.

How can the data be accessed?

The files can be accessed for further analysis e.g. by using pandas. For this, the following modules are necessary.

import pandas as pd
import os
import glob
How can singular files be accessed?

Singular files can be accessed by path, as in a regular folder structure. Their names always follow the same patterns (ARGUS_chunk_p*).

# Define which wave you want to access
wave = "2020_03"
singular_file = pd.read_csv(drive + r"Für mich freigegeben\Mannheimer Webpanel (MWP)\\" + wave + r"\ARGUS_chunk_p12.csv", delimiter='\t')
singular_file.head(5)
IDdl_rankdl_sloterrorredirectdescriptiontexttimestampurllinks
020111406550juwelo.tvNoneTrueEdelsteinschmuck in allen Variationen günstig …Ihr Experte für zertifizierten Edelsteinschmuc…Tue Mar 17 18:18:25 2020https://www.juwelo.de/juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
120111406551juwelo.tvNoneTrueLiebhaber eleganter Schmucktrends sollten sich…Ihr Experte für zertifizierten Edelsteinschmuc…Tue Mar 17 18:18:25 2020https://www.juwelo.de/tpc/juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
220111406552juwelo.tvNoneTrueJuwelo – AGB. Schmuckpräsentation und Herstell…Ihr Experte für zertifizierten Edelsteinschmuc…Tue Mar 17 18:18:25 2020https://www.juwelo.de/agb/juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
320111406553juwelo.tvNoneTrueRinge besetzt mit wunderschönen Edelsteinen, v…Ihr Experte für zertifizierten Edelsteinschmuc…Tue Mar 17 18:18:25 2020https://www.juwelo.de/ringe/juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
420111406554juwelo.tvNoneTrueSchmuck mit echten Edelsteinen von Dagen ist n…Ihr Experte für zertifizierten Edelsteinschmuc…Tue Mar 17 18:18:25 2020https://www.juwelo.de/dagen/juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
How can an entire wave be accessed?

Data from an entire wave can be accessed by looping over the csv-files. Keep in mind that the files are very large which might cause problems with your memory. Therefore, it is sensible to filter only the data you need in the loop.

# input directory
input_dir = drive + r"Für mich freigegeben\Mannheimer Webpanel (MWP)\\" + wave
os.chdir(input_dir)

# get all files in input directory
input_files = sorted(glob.glob('ARGUS_chunk_*.csv'))    

# structure for exemplary code
tick = 0
output = pd.DataFrame(columns=['number', 'percentage_errors'])
error_count = []

# loop over chunks
for input_file in input_files:

    #########################################
    # Exemplary calculations based on the MWP
    #########################################

    # Search for the following search term
    search_term = "digital"

    tick += 1 
    if tick > 2:
        break

    # load chunk in chunks due to file size
    for data in pd.read_csv(input_file, chunksize=20000, sep = "\t", encoding='utf-8', error_bad_lines=False):
        data = data.loc[data['text'].notnull(),:]
        temp = set(data.loc[data['text'].str.contains(search_term), 'ID'].values)
    temp.update(temp)

print(temp)
{2011052555, 2011057167, 2011061267, 2011027478, 2011052569, 2011047456, 2011063331, 2011025957, 2011042854, 2011026988, 2011041328, 2011058738, 2011066931, 2011044404, 2011059767, 2011046456, 2011065399, 2011024957, 2011022407, 2011027528, 2011029065, 2011024458, 2011032141, 2011044944, 2011052113, 2011074132, 2011028054, 2011053145, 2011036769, 2011060324, 2011037285, 2011048551, 2011031664, 2011025524, 2011065460, 2011054717, 2011032191, 2011062922, 2011040399, 2011026066, 2011024534, 2011051159, 2011027609, 2011041434, 2011049641, 2011017900, 2011016884, 2011053238, 2011072700, 2011031742, 2011043006, 2011062467, 2011032260, 2011015882, 2011055309, 2011029199, 2011063504, 2011047633, 2011024084, 2011023572, 2011066072, 2011012323, 2011046115, 2011063526, 2011028715, 2011053291, 2011051253, 2011058950, 2011063047, 2011023116, 2011045137, 2011051289, 2011024668, 2011025693, 2011049244, 2011033888, 2011044129, 2011043618, 2011043106, 2011050273, 2011036453, 2011052846, 2011052849, 2011044673, 2011049798, 2011050827, 2011062107, 2011040097, 2011034981, 2011021157, 2011044710, 2011029352, 2011030889, 2011041646, 2011039087, 2011039599, 2011030898, 2011041141, 2011044213, 2010999159, 2011037566, 2011051910, 2011051399, 2011039626, 2011048332, 2011036047, 2011030934, 2011038616, 2011039129, 2011027866, 2011042723, 2011062695, 2011037611, 2011026860, 2011040173, 2011032494, 2011049389, 2011025844, 2011062204, 2011025346, 2011032003, 2011054536, 2011042258, 2011059155, 2011048405, 2011040219, 2011027945, 2011037677, 2011041782, 2011025916, 2011055615}

The above output shows company IDs on whose corporate website the search term digital matched. This shall just give a very high level example on how to work with the MWP.

Can the MWP be combined with other data?

Please note that it is possible to combine the MWP with further firm-level data hosted at ZEW, such as for example the Mannheim Innovation Panel (MIP) and the IAB/ZEW Start-up Panel. This is how the MWP realizes its full potential. Find out more on the website of the ZEW-FDZ.

The Mannheim Web Panel was developed as part of the project BERD@BW funded by the Ministry of Science, Research and the Arts of Baden-Württemberg from 2019 to 2022.