Exploratory analysis of demographic data (Vall de almonacid 1612 to 1852)

1. Introduction

This is an exploratory analysis of demographic data from Vall de Almonacid (https://commons.wikimedia.org/wiki/File:VGeneral_VallAlmonacid.jpg). Vall de Almonacid (Spain) is a small village on the Valencian region. Of Muslim origin, it was conquered, refounded and repopulated with Catholics from the Aragon Kingdom in 1238.

In the church there are several record books where births, marriages and funerals were recorded by the church prist. In the 1980s these records were taken by Josep Maria Perez Rodriguez who copied them by hand with the help of Jose Maria Perez Rodriguez, Maria Rovira Barbera, Pilar Perez Rodriguez, and some others.

The hand-copied files were later digitalised (by hand, again) and a copy of these files is the starting point of this analysis. Not all data was digitalised and many typos and errors are present in the dataset.

For this analysis, I opted for Python (v.3.5.2+) as a flexible yet user-friendly tool to explore data.

I have previosuly done some visualisation experiments with this data using d3.js that you can find at https://fuyas.github.io/nacimientos/ (in Spanish).

In addition, there is a small report study authored by Josep Maria Perez Rodriguez about this demographic data (http://perezrovira.net/adria/pdfs/icap_demografia_JM.pdf, in Spanish).

2. Reading the data

2.1 Imports and external dependencies

All modules are commonly used modules

In [1]:
# plot figures inline in the IPython notebook
%matplotlib inline

import csv
import numpy as np
import datetime
import matplotlib.pyplot as plt
from collections import Counter

# set width of figures by default
plt.rcParams['figure.figsize'] = (14.0, 6.0)

2.2 Reading the data

To import the data I created a small function that reads the CSV files (there are two, one for baptisms and one for funerals). The function returns the header (name of the fields stored on the first line of the CSV file), a numpy 2D array with all fields as strings, and a second copy with all fields capitalised. The second copy is used to compare strings, where capitalisation might differ due to the poor quality of data. In addition to read the data the function prints a list of all the fields, detailing the number of entries for each field and the value and frequency of the most common entry.

In [2]:
def readDB( str_file ):

    file  = open(str_file, "r")
    reader = csv.reader(file)

    rownum = 0
    allData = []
    ALLDATA = []

    for iRow, row in enumerate(reader):

        # Header row.
        if iRow == 0:
            header = row

        else:
            allData.append( row )
            ALLDATA.append( [x.upper() for x in row] )

    file.close()

    # put all data in  numpy
    npData = np.array(allData)
    NPDATA = np.array(ALLDATA)

    print ( "\nNumber of Entries: %4d" % npData.shape[0] )
    print ( "Number of Columns: %4d\n" % npData.shape[1] )

    # Check number of entries per column and the most common value
    print ( "                         # - Unique - Most common" )

    for idx in range(len(header)):
        vec  = NPDATA[:,idx]
        vec2 = vec [ NPDATA[:,idx] != '' ]

        if len(vec2) > 0:
            count = Counter(vec2)
            print ( "[%2s] %-15s: %4d - %6d - ( %4d | %s )" % (idx, header[idx], vec2.size, len(count), count.most_common(1)[0][1], count.most_common(1)[0][0]) )
        else:
            print ( "[%2s] %-15s: %4d" % (idx, header[idx], vec2.size) )

    return [header, npData, NPDATA]

2.3 An overview of the births dataset

In [3]:
    [headerBirth, npBirthData, NPBIRTHDATA] = readDB ('data/allNacimientos.csv')

Number of Entries: 5237
Number of Columns:   42

                         # - Unique - Most common
[ 0] NOMBRE         : 5237 -   2442 - (  241 | MANUEL )
[ 1] APELL1         : 5221 -    461 - (  381 | RODRIGUEZ )
[ 2] APELL2         : 5217 -    569 - (  305 | RODRIGUEZ )
[ 3] SEXO           : 5237 -      2 - ( 2722 | HOMBRE )
[ 4] NACIMIENTO     : 5237 -   4967 - (    4 | 03/07/1839 )
[ 5] DIA            : 5237 -    366 - (   26 | 83 )
[ 6] YEAR           : 5237 -    216 - (   53 | 1831 )
[ 7] YEAR_ID        : 5237 -     53 - (  216 | 6 )
[ 8] DEFUNCION      :    0
[ 9] BAUTISMO       : 1117 -   1092 - (    3 | 04/07/39 )
[10] ENTERRAMIE     :    0
[11] TESTIGDEF      :    0
[12] ALBACEDEF      :    0
[13] ORIGEN         :    8 -      6 - (    3 | PARVULO )
[14] VECINDAD       :    0
[15] ESTADCIVIL     :    0
[16] LIBRAS         :    0
[17] DINEROS        :    0
[18] NOMPADRE       : 5224 -    237 - (  702 | MANUEL )
[19] APELL2PAD      : 5222 -   1133 - (  167 | RODRIGUEZ )
[20] NOMMADRE       : 5210 -    282 - (  986 | MARIA )
[21] APELL2MAD      : 5217 -   1265 - (   96 | SANCHO )
[22] VECINDPAD      :    2 -      2 - (    1 | PASAGEROS POBRES )
[23] ORIGPAD        :   25 -     18 - (    4 | (1) )
[24] ORIGMAD        :   55 -     37 - (    4 | SEGORBE )
[25] OFICIOPAD      :  533 -     65 - (  381 | LABRADOR )
[26] OFICIOMAD      :    0
[27] NPADRINO       : 3720 -    248 - (  437 | MANUEL )
[28] NMADRINA       : 3794 -    309 - (  454 | MARIA )
[29] OFPADRINO      :  331 -    109 - (   51 | CIRUJANO )
[30] OFMADRINA      :    8 -      7 - (    2 | CRIADA DEL GOVERNADOR )
[31] VECPADRINO     :  111 -     55 - (   18 | ALXIMIA )
[32] VECMADRINA     :  109 -     47 - (   20 | ALXIMIA )
[33] PARENTESCP     :  898 -     78 - (  224 | M )
[34] PARENTESCM     :  868 -    234 - (  227 | D )
[35] CONYGE1        :    0
[36] FEMATRI1       :    0
[37] ECIVCONY1      :    0
[38] NOTARIO1       :    0
[39] OBSERVACN      : 1798 -   1425 - (   27 | (1) ABUELOS PATERNOS SE IGNORAN )
[40] APELLMADRINA   : 3795 -    477 - (  295 | RODRIGUEZ )
[41] APELLPADRINO   : 3716 -    482 - (  424 | RODRIGUEZ )

The births CSV file contains 5237 entries with 42 columns. The number of entries for each column varies widely and 12 contain no data at all. We can already see that Rodriguez is the most common surname and Manuel the most common name. Looking at "[20] NOMMADRE" we can see that the most common name for mothers, hence women, is Maria.

Other interesting data, albeit expected, is that the most common job "[25] OFICIOPAD" is "LABRADOR" (english: farmer) while the most common job for the padrino (Godfather) is CIRUJANO (surgeon). From this it is possible to infer that the Godfather role was most commonly taken by a higher social class person.

In addition there are 1798 comments "observaciones" with rich, yet unstructured, information, for example:

In [4]:
print( npBirthData[427,39] )
bautizé sup condicione un niño enmcontrado cuyos padres se ignoran el qual fue allado a la puerta de Felipe TORRES  a media noche

English: I baptised sup condicione a boy with unknown parents who was found at the door of Felipe Torres at midnight.

In [5]:
print( npBirthData[4013,39] )
Fue bautizada en el acto de la nacencia por su abuela Theodora Pitarque y murio immediatamente.
Examine la forma e intencion del bautismo y allí que estubo bien administrado

English: Was baptised at birth by her grandmother Theodora Pitarque and died immediately. I examined the intention and correctness of the baptism and found that it was well performed.

2.4 An overview of the funerals dataset

In [6]:
[headerDeath, npDeathData, NPDEATHDATA] = readDB ('data/allDefunciones.csv')

Number of Entries: 4330
Number of Columns:   28

                         # - Unique - Most common
[ 0] ALBACEDEF      :  471 -    401 - (   18 | EL RETOR )
[ 1] APELL1         : 3115 -    440 - (  244 | RODRIGUEZ )
[ 2] APELL2         : 1341 -    212 - (  128 | RODRIGUEZ )
[ 3] APELL2MAD      : 2246 -    316 - (  186 | RODRIGUEZ )
[ 4] APELL2PAD      : 2483 -    282 - (  222 | RODRIGUEZ )
[ 5] CONYGE1        : 1275 -    898 - (   17 | D )
[ 6] DEFUNCION      : 3704 -   3466 - (    4 | 5 6 1748 )
[ 7] ECIVCONY1      : 1193 -     13 - (  787 | ESP )
[ 8] DEFUNCION2     : 3703 -   3458 - (    4 | 25/3/1826 )
[ 9] edatTEXT       : 2649 -    176 - ( 1523 | P )
[10] ENTERRAMIE     : 3155 -   2856 - (   11 | 25 )
[11] ENTERRAMIENTO2 : 3155 -   2856 - (   11 | 25 )
[12] LIBRAS         :   31 -      9 - (    7 | 20 )
[13] NOMBRE         : 3978 -    742 - (  273 | MARIA )
[14] NOMMADRE       : 2429 -    246 - (  442 | MARIA )
[15] NOMPADRE       : 2673 -    205 - (  368 | MANUEL )
[16] NOTARIO1       :   40 -     29 - (    9 | LABRADOR )
[17] OBSERVACN      : 2407 -   1548 - (  131 | IG )
[18] OFICIO         :  334 -     78 - (   84 | LABRADOR )
[19] OFICIOMAD      :    3 -      3 - (    1 | AMA )
[20] OFICIOPAD      :  235 -     36 - (  109 | LABRADORES )
[21] ORIGPAD        :    9 -      6 - (    3 | ALGIMIA )
[22] SEXO           : 3923 -      4 - ( 2034 | HOMBRE )
[23] SUELDOS        :    4 -      3 - (    2 | 0 )
[24] TESTIGDEF      : 1312 -    439 - (  545 | EL PARROCO )
[25] TIPO           :  340 -     48 - (  262 | POBRE )
[26] VECINDAD       :  750 -     58 - (  621 | Ç )
[27] VECINDPAD      :   18 -      6 - (   11 | Ç )

Again, the most common surname is Rodriguez, given names are Manuel and Maria, and the most common job is "labrador". At this point we can see that the data is very incomplete with only 334 job entries out the 4330 subjects. However names of parents are plentiful (~2500, so we can hope to reconstruct some genealogy tree later on) and there are also observaciones (comments) in 2407 entries, which are quite interesting:

In [7]:
print( npDeathData[1051,17] )
IG
HF
No sacramentos: murió de repente sin poderlo remediar pues cayó en un lagar lleno de vino

English: IG, HF (Hijo de Familia, denotes a teenager), No Sacrament: Died suddenly without remedy because he fell in a wine press full of wine.

In [8]:
print( npDeathData[3794,17] )
Se encontro asesinado el dia 14 en la partida llamada mojonada de este termino natural y vecino de Segorbe

English: Was found murdered the on 14th in the "Mojonada" in this municipality. He was from Segore (Segorbe is the largest town in the region, 8Km away from Vall de Almonacid)

In [9]:
print( npDeathData[2825,17] )
De ALTURA que estaba en este pueblo por huir de los Franceses
Se le hizo entierro ordinario que pagaron sus hijos

English: From Altura (another village, 10 km away) who was in this village fleeing from the French. A regular funeral was performed paid by his children.

An yes, that was during the Spanish-French war (https://en.wikipedia.org/wiki/Peninsular_War):

In [10]:
print( npDeathData[2825,8] )
23/7/1811

In [11]:
print( npDeathData[524,17] )
“aviendo enfermado de enfermedad mortal, dicho dia al amanecer fue allada aogada en la balsa de la huerta mayor, y dudando el pueblo de la Vall si se le avia de dar sepultura o, no, cosa que al Pueblo no pertenecia sino al señor obispo de Segorbe y su vicario General, el señor don Chisostomo ROYO DE CASTELLVI, o RENS de la Sta Iglesia de Segorbe, su oficial general dicho dia mandaron que asta que se averiguare con testigos e informasemos si se avia desesperado, o, no, que la metiesen en una caxa hecha de tablas y fuese llevado el cadaver, asta que le recibieren los testº y se hicieron las debidas deligencias todo lño qual se hizo obedeciuendo al dicho Rereen dissimo Sr, fue llevado el cadaver a la partida llamada rincon del pico de espadan cerca de la raya del termino distanmcia un tiro de piedra en donde ay un secano cuyo dueño vivae en la alcudia. Despues se recibieron ofrendas de personas fidelkes y testificaron que el dia de S Pedro la avian visto confesar con el cura por espoacio de dos horas y el mismo cura certifica que se comparo generalmente  de toda su vida y aviendo tenido tar juntos de Teologos en diferentes dias se informaron del medico que la avia visitado en una enfermedad y dicho y testifico que su enfermedad le hallo en punto, que iba dedicando a delirio ultimamente dichos informaciones determino y mando a san Vicario general que se tocasen a muerto y se le dijera misa cantada de cuerpo presente y que el curo fuese personalmente  al pico de Espedan y que se le vendijese la sepultaura solemnemente y que estubiese el cadaver depritado asta tanto que qede convertido y despues que rezen tratidos los mesos la misma caxa el cementerio a Campo Santo del dixo lugar en donde esta enterrado los demas fieles Serian diez libras para que se celebre bien por su alma y que se le diga misas rezadas....
Esta enterrado el cadaver en el cementerio al pie de la cruz en 09.08.1682

English: Being mortally ill she was found drowned on the fields watering pond. The village doubted if she was to be buried on the graveyard or not (suicides would prevent people to get buried there) but the decision was to be taken by the bishop and the church. The general official of the bishop then instructed to put the body in a plank box and leave it to the "Rincon del pico espadan", at the edge of the municipality, until the witnesses could testify and the investigation was concluded. Witnesses testified that they had seen her confessing with the priest for 2 hours and the priest detailed that she had confesses about her entire life. Theologians consulted the doctor that had attended her and he explained that her illness induced her delirium. With this information the general vicar instructed to toll for the dead and that she was to have sang mass with the deceased's body present. In addition, once the body had been converted, the priest was to go personally to the "Rincon del pico espadan" and perform the funeral duties solemnly, accompanying the body until she was buried in the graveyard with all other Catholics.

3. Digging into the data

Now that we have an idea of what is in the data, let's see what can we learn from it.

3.1 The most common values for each field

To check the most common entries of each field I wrote the following function, which given a list of entries relies on the Counter structure to sort the items in the list and print them in order, with the number of occurrences in brackets:

In [12]:
def printMostCommon( vec, n=float("inf") ):

    vec2 = vec [ vec != '' ]
    count = Counter(vec2)

    print ("Most common entries:")
    for idx in range(len(count)):
        if idx  > n-1:
            break
        print ( "%4d : %s" % (count.most_common()[idx][1], count.most_common()[idx][0]) )

A quick look to jobs show that something is amiss.

In [13]:
printMostCommon( NPDEATHDATA[:,18], 15 )
Most common entries:
  84 : LABRADOR
  37 : LABRADORA
  36 : HOSPITAL GENERAL
  21 : STO HOSP
  18 : HOSP GRAL VCIA
  16 : AMA
  10 : STO HOSPITAL
   6 : PASTOR
   6 : LABRADOR POBRE
   5 : LABRADORA POBRE
   4 : TEXEDOR
   4 : LABRADORES
   4 : HOSPITAL GENERAL DE VALENCIA
   3 : CIRUJANO
   3 : SASTRE

So followed by farmer, the most common job was... farmer (but the female version of it)!

Nevertheless something is amiss here, as the 3rd, 4th, and 5th most common jobs are a hospital. The most plausible explanation though is that these were kids that had been taken from a hospital (couples who couldn't have kids, or only had girls, usually did that back in the day) and died young, and the priest (who wrote these records) opted to mention that in the job field.

Notice that beside farmer, another "popular" job was "LABRADOR POBRE" (Poor farmer).

In [14]:
printMostCommon( NPDEATHDATA[:,13], 15 )
Most common entries:
 273 : MARIA
 271 : MANUEL
 183 : FRANCISCO
 148 : JOSEPH
 148 : MIGUEL
 127 : MANUELA
 124 : JOSEF
 120 : VICENTE
 111 : ROSA
  80 : JUAN
  79 : FRANCISCA
  78 : VICENTA
  67 : TERESA
  64 : MARIA ROSA
  51 : ANTONIO

Looking at the 15 most common names we can see Joseph and Josef add to 272 occurrences, overtaking Manuel as the most common name. And that is without counting all the Jose and compound names so popular in Spain.

3.2 Delay between death and funeral

It is common in Spain to bury the deceased in the few immediate days after death. That contrasts with other Catholic countries, like Ireland, where burials can happen up to a week later than the death.

We can check the distance between death and burial, as we have these two dates in NPDEATHDATA[:,8] and NPDEATHDATA[:,11]. For this we have to use the try instruction to avoid getting errors when there is no date string or this is uncomplete.

In [15]:
diffBurial = []
for ii, defuncion in enumerate(NPDEATHDATA[:,8]):
    if (defuncion != '') and (NPDEATHDATA[ii,11] != ''):
        try:
            deathDate  = datetime.datetime.strptime(defuncion,     '%d/%m/%Y').date()
            burialDate = datetime.datetime.strptime(NPDEATHDATA[ii,11], '%d/%m/%Y').date()
            diffBurial.append( (burialDate - deathDate).days )
        except: 
            pass

delay = Counter(diffBurial)
print ( "\nDelay between death and funeral for %d funerals:" % len(diffBurial))
for idx in range(len(delay)):
    print ( "%4d : %d dias" % (delay.most_common()[idx][1], delay.most_common()[idx][0]) )

Delay between death and funeral for 2349 funerals:
2283 : 1 dias
  25 : 0 dias
  13 : 2 dias
   4 : -1 dias
   3 : -30 dias
   2 : 367 dias
   1 : 3 dias
   1 : 4 dias
   1 : 5 dias
   1 : 7 dias
   1 : -89848 dias
   1 : 10 dias
   1 : -1460 dias
   1 : 335 dias
   1 : -364 dias
   1 : 21 dias
   1 : -96423 dias
   1 : 90 dias
   1 : 93 dias
   1 : 29 dias
   1 : 366 dias
   1 : 8 dias
   1 : -81479 dias
   1 : 381 dias
   1 : -2 dias

From 2349 funerals, 2283 (97.2%) happened the day immediately after the death, 25 the same day, 13 after 2 days, and a 7 between 3 and 21 days.

Keep in mind, however, that for the 4330 entreis, there is no recording of the death date in around 2000. Probably because dead and funeral happened at the same time and the priests opted for only filling one field in the book.

In addition it can clearly be seen that there are numerous typos on the data, as 12 entries report a funeral before the death. In 12 other cases it looks like there is shift likely to be due to a typo (1 month, 3 months, 1 year, etc).

3.3 Distribution of events in time

Next step is to visualise the distribution of births and deaths by year, and also by month. For that I wrote these 3 functions:

In [16]:
# Hide the right and top axes
def remove2axis(ax):
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

    # Only show ticks on the left and bottom spines
    ax.yaxis.set_ticks_position('left')
    ax.xaxis.set_ticks_position('bottom')
    
# Modified z-score
def MADscore(vec):
    median = np.median(vec)
    median_absolute_deviation = np.median([np.abs(v - median) for v in vec])
    modified_z_scores = [0.6745 * (v - median) / median_absolute_deviation for v in vec]
    return modified_z_scores

# Print and plot distributions
def plotDateDistribution ( vec, str_ylabel, showPlot=True, showByYear=True, showByDay=False ):

#        str_day = ['Domingo', 'Lunes', 'Martes', 'Miercoles', 'Jueves', 'Viernes', 'Sabado']
#        str_months = ['Enero', 'Febrero', 'Marzo', 'Abril', 'Mayo', 'Junio',
#                    'Julio', 'Agosto', 'Setiembre', 'Octubre', 'Noviembre', 'Diciembre']
        
        str_day = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
        str_months = ['January', 'February', 'March', 'April', 'May', 'June',
                    'July', 'August', 'September', 'October', 'November', 'December']

        distributionDay   = [0] * 7
        distributionMonth = [0] * 12
        distributionYear  = [0] * 2000

        # put entries in bin for each occurance
        for d in vec:
            try:
                deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date()
                distributionDay[deathDate.weekday()] += 1
                distributionMonth[deathDate.month-1] += 1
                distributionYear[deathDate.year]     += 1
            except: 
                pass

        # compute z-scores to detect outliers
        zScoreDay   = MADscore(distributionDay)
        zScoreMonth = MADscore(distributionMonth)
        zScoreYear  = MADscore(distributionYear)

        # find first and last year with data 
        min_year = next(i for i, v in enumerate(distributionYear) if v > 0)
        max_year_tmp = next(i for i, v in enumerate(reversed(distributionYear)) if v > 0)
        max_year = len(distributionYear) - 1 - max_year_tmp
        yearRange = np.arange(min_year, max_year+1)

        # Print distributions on console
        if showByDay:
            print( "\nBy day: ")
            for i, v in enumerate(distributionDay):
                print ( "%12s : %4d ( %+.2f )" % (str_day[i], v, zScoreDay[i]) )

        print( "\nBy month: ")
        for i, v in enumerate(distributionMonth):
            print ( "%12s : %4d ( %+.2f )" % (str_months[i], v, zScoreMonth[i]) )

        if showPlot:
            fig = plt.figure()
            fig.suptitle("By Month")
            ax1 = plt.gca()
            ax1.bar(range(12), distributionMonth, align='center', linewidth = '0', alpha=0.8, color='darkorange')
            ax1.set_xticks( range(12) )
            ax1.set_xticklabels( [v[0] for v in str_months] )
            ax1.set_ylabel( str_ylabel )
            plt.margins(0.05, 0)
            x1,x2,y1,y2 = ax1.axis()
            ax1.axis((x1,x2,y1,y2 + 5))
            remove2axis(ax1)

            if showByDay:
                fig = plt.figure()
                fig.suptitle("By Day")
                ax2 = plt.gca()
                ax2.bar(range(7), distributionDay, align='center', linewidth = '0', alpha=0.8, color='darkorange')
                ax2.set_xticks( range(7) )
                ax2.set_xticklabels( [v[0] for v in str_day] )
                plt.ylabel( str_ylabel )
                plt.margins(0.05, 0)
                x1,x2,y1,y2 = ax2.axis()
                ax2.axis((x1,x2,y1,y2 + 5))
                remove2axis(ax2)

            if showByYear:
                values = distributionYear[min_year:max_year+1]
                fig = plt.figure()
                fig.suptitle("By Year")
                ax3 = plt.gca()
                ax3.bar( yearRange, values, align='center', width=1, linewidth = '0', alpha=0.8, color='darkorange')
                ax3.set_xticks( [v for v in yearRange if v % 10 == 0] )
                plt.ylabel( str_ylabel )
                plt.margins(0.03, 0)
                x1,x2,y1,y2 = ax3.axis()
                ax3.axis((x1,x2,y1,y2 + 5))
                remove2axis(ax3)

#            fig.subplots_adjust(left=0.06, right=0.99, top= 0.99, bottom=0.05)
            plt.show()

        else:
            fig = None

        return [fig, distributionDay, distributionMonth, distributionYear]

Notice that this function also allows to explore the distributions of births and deaths across the days of the week. However I removed that part of the analysis as nothing interesting came out of it (all distributions were basically uniform)

3.3.1 When did births occur?

In [30]:
[figBirth, birthByDay, birthByMonth, birthByYear] = plotDateDistribution ( npBirthData[:,4], '# Births', True )

By month: 
     January :  447 ( +0.09 )
    February :  475 ( +0.65 )
       March :  570 ( +2.57 )
       April :  438 ( -0.09 )
         May :  513 ( +1.42 )
        June :  452 ( +0.19 )
        July :  462 ( +0.39 )
      August :  354 ( -1.78 )
   September :  336 ( -2.14 )
     October :  432 ( -0.21 )
    November :  408 ( -0.69 )
    December :  350 ( -1.86 )

The distribution by month shows a higher rate of births in the spring. March, the month with highest number of births has a modified z-score of 2.57. So a bit of an outlier but nothing exceptional (usually an outlier is considered as such when the z-score is over 3.5).

Despite not statistically significant, December shows a reducition in births, probably due to the reduced conception during Lent.

On the distribution by year we can see a steady increase in births along time, with a marked boom in the 1750 to 1770.

3.3.2 When did deaths occur?

In [18]:
[figDeath, deathByDay, deathByMonth, deathByYear] = plotDateDistribution ( npDeathData[:,8], '# Funerals', True )

By month: 
     January :  322 ( +0.78 )
    February :  236 ( -1.25 )
       March :  272 ( -0.40 )
       April :  293 ( +0.09 )
         May :  277 ( -0.28 )
        June :  259 ( -0.71 )
        July :  364 ( +1.77 )
      August :  486 ( +4.66 )
   September :  325 ( +0.85 )
     October :  285 ( -0.09 )
    November :  267 ( -0.52 )
    December :  316 ( +0.64 )

From the month distribution we see that August (A) has an abnormal spike of deaths with a modified z-score of +4.66 (Usually anything above 3.5 is considered outlier). In addition July and September are the 2nd and 3rd months with highest number of funerals.

At first we could think that given the high child mortality back then, a spike in deaths could be explained by a spike in births. However looking at the distribution of births previously plotted we see than that is not case, with August and Septembre being the months with lowest number of births. Could it be that summer and the spanish heat is worse for the population? Maybe it was due to accidents in the fields, wars being more common in the summer? We don't know, but we will come back to analysis in Section 3.4.

Distribution by year show no births from 1637 until 1663. This, I have been told, was due not because of a lack of births, but because the sheets from the records book of that period were tore. Probably to light a fire...

Similarly, in 1780 there are no deaths reported, but this is also probably due to missing records, as birth records exist therefore it is likely that the priest would have recorded funerals as well.

Another interesting fact is 1803, with that huge spike in deaths: Could it be wars, the plague, a bad crop? We don't know but we will take a look to that later on in Section 3.5, to see what the data says.

3.4 Trying to explain the spike of deaths in the summer

As seen in the distribution of deaths by month of the year, the summer months showed a higher mortality. This was a bit of a surprise for me, as I expected a nearly uniform distribution or an increase during winter, when the conditions are harsh (subzero temperatures are not rare during the winter months in this region).

In order to investigate any possible underlying cause that explains the super mortality spike, we first will separate adults from children:

In the field NPDEATHDATA[:,9] there is written an abbreviation of the age of people at their funerals. 'ALB' or 'A' (Albado/a), 'P' (Parvulo), 'N' (Niño/a), 'NIÑO' and 'NIÑA' denote children, while an empty field '' denote adults. HF (Hijo de familia) and the remaining indicators are excluded from this analysis.

In [19]:
printMostCommon( NPDEATHDATA[:,9], 15 )
Most common entries:
1523 : P
 391 : A
 158 : N
  85 : ALB
  57 : HF
  25 : RN
  23 : 2
  19 : 1
  16 : D
  11 : MY
   9 : NIÑA
   8 : 4
   7 : 18M
   7 : 73
   7 : NIÑO

In [20]:
# Separate adults and children
adults = []
children = []

for i, v in enumerate(NPDEATHDATA[:,9]):
    if v == '':
        adults.append(NPDEATHDATA[i,8])
    if v in ['P', 'ALB', 'A', 'N', 'NIÑA', 'ŃIÑO']:
        children.append(NPDEATHDATA[i,8])

print ("Total number of adult deaths: %d" % len(adults))

[figDeathAdults, deathByDayAdults, deathByMonthAdults, deathByYearAdults] = \
    plotDateDistribution ( adults, '# Funerals Adults', True )
Total number of adult deaths: 1681

By month: 
     January :  138 ( +1.00 )
    February :  121 ( -0.10 )
       March :  120 ( -0.16 )
       April :  141 ( +1.19 )
         May :  122 ( -0.03 )
        June :  111 ( -0.74 )
        July :  123 ( +0.03 )
      August :  138 ( +1.00 )
   September :  113 ( -0.61 )
     October :  142 ( +1.25 )
    November :  118 ( -0.29 )
    December :  135 ( +0.80 )

When only looking at adult deaths, the distribution per months seems pretty niform, without a summer spike as before.

In [21]:
print ("Total number of children deaths: %d" % len(children))

[figDethChildren, deathByDayChildren, deathByMonthChildren, deathByYearChildren] = \
    plotDateDistribution ( children, '# Funerals Children', True )
Total number of children deaths: 2166

By month: 
     January :  154 ( +1.10 )
    February :   91 ( -1.56 )
       March :  130 ( +0.08 )
       April :  118 ( -0.42 )
         May :  120 ( -0.34 )
        June :  126 ( -0.08 )
        July :  211 ( +3.50 )
      August :  316 ( +7.93 )
   September :  181 ( +2.23 )
     October :  121 ( -0.30 )
    November :  124 ( -0.17 )
    December :  150 ( +0.93 )

When we focus on children, the summer spike becomes increasingly pronounced (z-score +7.93). This should then exclude the explanation that summer deaths were related to an increase in labour activity or warfare, a task mainly performed by adults, or older children.

To exclude violent deaths as the cause of the summer spike in mortality, we can look at deaths separated by gender. A high rate of male deaths could indicate a spike in violent deaths, as men were more involved in warfare and farming labour:

In [22]:
# Spearate adults deaths by gender
male = []
female = []

for i, v in enumerate(NPDEATHDATA[:,9]):
    if v == '':
        if NPDEATHDATA[i,22] == 'HOMBRE':
            male.append(NPDEATHDATA[i,8])
        else:
            female.append(NPDEATHDATA[i,8])

print ("Total number of male adult deaths: %d" % len(male))
print ("Total number of female adult deaths: %d" % len(female))
Total number of male adult deaths: 796
Total number of female adult deaths: 885

And the ratio by month:

In [23]:
deathByMonthMaleAdults   = [0] * 12
deathByMonthFemaleAdults = [0] * 12

for m in male:
    try:
        deathDate = datetime.datetime.strptime(m, '%d/%m/%Y').date()
        deathByMonthMaleAdults[deathDate.month-1] += 1
    except: 
        pass

for f in female:
    try:
        deathDate = datetime.datetime.strptime(f, '%d/%m/%Y').date()
        deathByMonthFemaleAdults[deathDate.month-1] += 1
    except: 
        pass    
    
MFratio = np.array(deathByMonthMaleAdults) / (np.array(deathByMonthFemaleAdults) + np.array(deathByMonthMaleAdults)) * 100
MFscore = MADscore(MFratio)

print( "\nBy month: ")
for i, v in enumerate(MFratio):
    print ( "%12s : %4.1f ( %+.2f )" % (i, v, MFscore[i]) )

By month: 
           0 : 53.6 ( +1.84 )
           1 : 45.5 ( -0.35 )
           2 : 47.5 ( +0.20 )
           3 : 48.9 ( +0.58 )
           4 : 41.0 ( -1.55 )
           5 : 45.9 ( -0.22 )
           6 : 43.9 ( -0.77 )
           7 : 52.2 ( +1.45 )
           8 : 46.0 ( -0.20 )
           9 : 50.0 ( +0.87 )
          10 : 44.9 ( -0.49 )
          11 : 57.0 ( +2.75 )

In [24]:
fig = plt.figure()
plt.plot(range(12), MFratio, alpha=0.8, linewidth=4, color='darkorange')
plt.plot([-0, 11], [50, 50], linewidth=2, alpha=0.3, color='gray')
plt.ylabel( 'Percentage of deaths that were male' )
plt.xticks( range(12), ['J', 'F', 'M', 'A', 'M', 'J', 'J', 'A', 'S', 'O', 'N', 'D'] )
plt.margins(0.05)
remove2axis(plt.gca())

The rate of (adult) male deaths remains between 41% (May) to 57% (December), with an august rate of 52.2% in August. This constant ratio suggests that the increase in summer deaths were not violent in nature.

So while it is hard to extract conclusions from this limited dataset, the data shows that the summers had an increased mortality among children that was not related to violent deaths.

When we compare this data with other demographic studies we see that in "Evolucion Demografica de Cortes de Arenoso desde 1560 a 1660" by Antonio Poveda Ayora, MSc Thesis, Universidad de Valencia, 1982, a similar spike is reported in the month of August (with a 38% increase over the mean). In page 282 he reports:

"El máximo de defunciones que se registran en Cortes de Arenoso en los meses de agosto y septiembre puede deberse a la incidencia que, sobre estos meses, tendría la mortalidad de niños, pues las causas de la mortalidad infantil eran, fundamentalmente, las diarreas estivales antes de las lluvias de otoño, debidas a la sequedad, altas temperaturas, empleo de aguas contaminadas, etc."

English: "The maximum of deaths registered in Cortes de Arenoso in August and September can be due the high children's mortality, as main children mortality causes were summer diarrhea before autumn rains as consequence of dry weather, high temperatures, contaminated waters, etc."

3.5 Child mortality as an index of prosperity

We have seen that the period between 1750 and 1780 showed an increase in births that might be attributed to a higher standard of living. To confirm that we can take a look at child mortality, as there is an association between higher child mortality and poorer living conditions.

In [25]:
# we already have computed deathByYearChildren and deathByYearAdult
#ChildrenMortalityRatio = np.array(deathByYearChildren) / (np.array(deathByYearAdults) + np.array(deathByYearChildren)) * 100
#CRscore = MADscore(ChildrenMortalityRatio)

childrenMortalityRatio = [np.nan] * 2000
for i, v in enumerate(deathByYearChildren):
    if v > 0 or deathByYearAdults[i] > 0:
        childrenMortalityRatio[i] = deathByYearChildren[i] / (deathByYearChildren[i] + deathByYearAdults[i]) * 100

#print( childrenMortalityRatio )

# find first and last year with data 
min_year = next(i for i, v in enumerate(childrenMortalityRatio) if v != np.nan)
max_year_tmp = next(i for i, v in enumerate(reversed(childrenMortalityRatio)) if v != np.nan)
max_year = len(childrenMortalityRatio) - 1 - max_year_tmp
yearRange = np.arange(min_year, max_year+1)

fig = plt.figure()
fig.suptitle("Child mortality by year")
plt.bar(yearRange, childrenMortalityRatio[min_year:max_year+1], align='center', linewidth=0, width=1, color='darkorange', alpha=0.8)
#plt.plot([-0, 11], [50, 50])
plt.ylabel( 'Children mortality ratio' )

plt.xticks( [v for v in yearRange if v % 10 == 0] )
plt.margins(0.03, 0)
remove2axis(plt.gca())

It is clear from that graph that the expected decline in child mortality is not present in the period between 1750 to 1770. In addition we can see that there is virtually no child mortality between 1705 and 1750. This was clear not the case and the likely reason for that gap is that child mortality was not reported during this period.

Similarly, the 100% child mortality reported from 1840 to 1850 is probably a consequence of bad encoding of deaths of adults, or bad parsing of the data.

3.6 What about 1803?

As seen in the plot of all deaths by year,a huge spike in mortality occurs in 1803. To study the possible explanation of that spike we first look at the distributions by month on that year:

In [26]:
deaths1803 = []
deaths1803Adult = []
deaths1803Child = []

for i, d in enumerate(npDeathData[:,8]):
    try:
        deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date()
        if deathDate.year == 1803:
            deaths1803.append(d)
            if NPDEATHDATA[i,9] == '':
                deaths1803Adult.append(d)
            if NPDEATHDATA[i,9] in ['P', 'ALB', 'A', 'N', 'NIÑA', 'ŃIÑO']:
                deaths1803Child.append(d)
    except: 
        pass

[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803, '# Funerals in 1803', True, False )

By month: 
     January :   33 ( +9.44 )
    February :   10 ( +1.69 )
       March :    6 ( +0.34 )
       April :    5 ( +0.00 )
         May :    5 ( +0.00 )
        June :    1 ( -1.35 )
        July :    3 ( -0.67 )
      August :    7 ( +0.67 )
   September :    4 ( -0.34 )
     October :    5 ( +0.00 )
    November :    0 ( -1.69 )
    December :    3 ( -0.67 )

If we lat at the distribution independently for children and adults:

In [27]:
[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803Adult, '# Adult funerals in 1803', True, False )
[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803Child, '# Children funerals in 1803', True, False )

By month: 
     January :    1 ( +0.67 )
    February :    0 ( -0.67 )
       March :    4 ( +4.72 )
       April :    1 ( +0.67 )
         May :    0 ( -0.67 )
        June :    0 ( -0.67 )
        July :    0 ( -0.67 )
      August :    1 ( +0.67 )
   September :    0 ( -0.67 )
     October :    1 ( +0.67 )
    November :    0 ( -0.67 )
    December :    1 ( +0.67 )


By month: 
     January :   32 ( +9.44 )
    February :   10 ( +2.02 )
       March :    2 ( -0.67 )
       April :    4 ( +0.00 )
         May :    5 ( +0.34 )
        June :    1 ( -1.01 )
        July :    3 ( -0.34 )
      August :    6 ( +0.67 )
   September :    4 ( +0.00 )
     October :    4 ( +0.00 )
    November :    0 ( -1.35 )
    December :    2 ( -0.67 )

We can see that the exceptional number of 33 deaths (of which 32 where children) occured in the single month of January 1803. Was that a trend coming from 1802?

In [28]:
deaths1802 = []
for d in npDeathData[:,8]:
    try:
        deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date()
        if deathDate.year == 1802:
            deaths1802.append(d)
    except: 
        pass

[figDeath1802, deathByDay1802, deathByMonth1802, deathByYear1802] = plotDateDistribution ( deaths1802, '# Funerals 1802', True, False )

By month: 
     January :    4 ( +0.45 )
    February :    0 ( -1.35 )
       March :    5 ( +0.90 )
       April :    3 ( +0.00 )
         May :    5 ( +0.90 )
        June :    1 ( -0.90 )
        July :    1 ( -0.90 )
      August :    2 ( -0.45 )
   September :    2 ( -0.45 )
     October :    3 ( +0.00 )
    November :    3 ( +0.00 )
    December :    5 ( +0.90 )

No, January 1803 was an isolated case of high mortality. Maybe a bad crop, a strong winter, or a sever flu? We don't know.

Ot the spike is because somehow a bunnch of births were recorded with the same date repeated? Maybe the 1st of January of 1803?

In [29]:
# to dates type
dates1803 = [datetime.datetime.strptime(strDate, '%d/%m/%Y').date() for strDate in deaths1803]
dayOfMonth = [d.day for d in dates1803 if d.month == 1] # only for january
print(dayOfMonth)

# because who needs np.histogram()?
distDay = [0] * 32
for d in dayOfMonth:
    distDay[d] += 1

fig = plt.figure()
plt.bar(range(len(distDay)), distDay, alpha=0.8, linewidth=0, color='darkorange')
plt.ylabel( '# Funerals January 1803, per day' )
remove2axis(plt.gca())
[3, 4, 4, 4, 5, 5, 5, 8, 8, 10, 10, 11, 11, 13, 14, 14, 15, 15, 16, 16, 17, 19, 19, 23, 25, 25, 25, 25, 27, 28, 29, 31, 31]

No, the deaths are spread among the days of the month. January 1803 was simply a tough month.

4. Discussion and conclusions

While the data contains numerous inaccuracies and typographical errors, we can already see certain trends.

The first interesting result that arises from this study is that summers, specially August, showed a higher children mortality rate than the rest of the year. This trend is seen in other demographic studies of the region where it is attributed to the hotter conditions before autunm rains. In this study we showed that the reasons for such increase in mortality were probably not related to violent deaths, as male mortality (associated with warfare and labour accidents) remained under 50% during this period.

Furthermore, the months of Decembre and January showed a second spike in mortality, more pronounced among children andmales. This is also consistent with other studies of the region.

A second result of interest is that January of 1803 was a month with very high mortality among children, albeit we don't know the reason for that.

In addition, we can see an increase in number of births between 1750 and 1780. However we do not see a decrease in child mortality during the same period that would suggest an increase in standards of living during this period.

The data is available to be download as CSV files at:

http://perezrovira.net/adria/demografiaNotebook/allDefunciones.csv

http://perezrovira.net/adria/demografiaNotebook/allNacimientos.csv

Adria Perez-Rovira (v. 24/01/2017)

Copyright 2017 Adria Perez-Rovira (MIT License)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

In [29]: