# Exploratory analysis of demographic data (Vall de almonacid 1612 to 1852)¶

## 1. Introduction¶

This is an exploratory analysis of demographic data from Vall de Almonacid (https://commons.wikimedia.org/wiki/File:VGeneral_VallAlmonacid.jpg). Vall de Almonacid (Spain) is a small village on the Valencian region. Of Muslim origin, it was conquered, refounded and repopulated with Catholics from the Aragon Kingdom in 1238.

In the church there are several record books where births, marriages and funerals were recorded by the church prist. In the 1980s these records were taken by Josep Maria Perez Rodriguez who copied them by hand with the help of Jose Maria Perez Rodriguez, Maria Rovira Barbera, Pilar Perez Rodriguez, and some others.

The hand-copied files were later digitalised (by hand, again) and a copy of these files is the starting point of this analysis. Not all data was digitalised and many typos and errors are present in the dataset.

For this analysis, I opted for Python (v.3.5.2+) as a flexible yet user-friendly tool to explore data.

I have previosuly done some visualisation experiments with this data using d3.js that you can find at https://fuyas.github.io/nacimientos/ (in Spanish).

### 2.1 Imports and external dependencies¶

All modules are commonly used modules

In [1]:
# plot figures inline in the IPython notebook
%matplotlib inline

import csv
import numpy as np
import datetime
import matplotlib.pyplot as plt
from collections import Counter

# set width of figures by default
plt.rcParams['figure.figsize'] = (14.0, 6.0)


To import the data I created a small function that reads the CSV files (there are two, one for baptisms and one for funerals). The function returns the header (name of the fields stored on the first line of the CSV file), a numpy 2D array with all fields as strings, and a second copy with all fields capitalised. The second copy is used to compare strings, where capitalisation might differ due to the poor quality of data. In addition to read the data the function prints a list of all the fields, detailing the number of entries for each field and the value and frequency of the most common entry.

In [2]:
def readDB( str_file ):

file  = open(str_file, "r")

rownum = 0
allData = []
ALLDATA = []

if iRow == 0:

else:
allData.append( row )
ALLDATA.append( [x.upper() for x in row] )

file.close()

# put all data in  numpy
npData = np.array(allData)
NPDATA = np.array(ALLDATA)

print ( "\nNumber of Entries: %4d" % npData.shape[0] )
print ( "Number of Columns: %4d\n" % npData.shape[1] )

# Check number of entries per column and the most common value
print ( "                         # - Unique - Most common" )

vec  = NPDATA[:,idx]
vec2 = vec [ NPDATA[:,idx] != '' ]

if len(vec2) > 0:
count = Counter(vec2)
print ( "[%2s] %-15s: %4d - %6d - ( %4d | %s )" % (idx, header[idx], vec2.size, len(count), count.most_common(1)[0][1], count.most_common(1)[0][0]) )
else:
print ( "[%2s] %-15s: %4d" % (idx, header[idx], vec2.size) )



### 2.3 An overview of the births dataset¶

In [3]:
    [headerBirth, npBirthData, NPBIRTHDATA] = readDB ('data/allNacimientos.csv')


Number of Entries: 5237
Number of Columns:   42

# - Unique - Most common
[ 0] NOMBRE         : 5237 -   2442 - (  241 | MANUEL )
[ 1] APELL1         : 5221 -    461 - (  381 | RODRIGUEZ )
[ 2] APELL2         : 5217 -    569 - (  305 | RODRIGUEZ )
[ 3] SEXO           : 5237 -      2 - ( 2722 | HOMBRE )
[ 4] NACIMIENTO     : 5237 -   4967 - (    4 | 03/07/1839 )
[ 5] DIA            : 5237 -    366 - (   26 | 83 )
[ 6] YEAR           : 5237 -    216 - (   53 | 1831 )
[ 7] YEAR_ID        : 5237 -     53 - (  216 | 6 )
[ 8] DEFUNCION      :    0
[ 9] BAUTISMO       : 1117 -   1092 - (    3 | 04/07/39 )
[10] ENTERRAMIE     :    0
[11] TESTIGDEF      :    0
[12] ALBACEDEF      :    0
[13] ORIGEN         :    8 -      6 - (    3 | PARVULO )
[16] LIBRAS         :    0
[17] DINEROS        :    0
[18] NOMPADRE       : 5224 -    237 - (  702 | MANUEL )
[19] APELL2PAD      : 5222 -   1133 - (  167 | RODRIGUEZ )
[20] NOMMADRE       : 5210 -    282 - (  986 | MARIA )
[21] APELL2MAD      : 5217 -   1265 - (   96 | SANCHO )
[22] VECINDPAD      :    2 -      2 - (    1 | PASAGEROS POBRES )
[23] ORIGPAD        :   25 -     18 - (    4 | (1) )
[24] ORIGMAD        :   55 -     37 - (    4 | SEGORBE )
[25] OFICIOPAD      :  533 -     65 - (  381 | LABRADOR )
[27] NPADRINO       : 3720 -    248 - (  437 | MANUEL )
[28] NMADRINA       : 3794 -    309 - (  454 | MARIA )
[29] OFPADRINO      :  331 -    109 - (   51 | CIRUJANO )
[31] VECPADRINO     :  111 -     55 - (   18 | ALXIMIA )
[32] VECMADRINA     :  109 -     47 - (   20 | ALXIMIA )
[33] PARENTESCP     :  898 -     78 - (  224 | M )
[34] PARENTESCM     :  868 -    234 - (  227 | D )
[35] CONYGE1        :    0
[36] FEMATRI1       :    0
[37] ECIVCONY1      :    0
[38] NOTARIO1       :    0
[39] OBSERVACN      : 1798 -   1425 - (   27 | (1) ABUELOS PATERNOS SE IGNORAN )
[40] APELLMADRINA   : 3795 -    477 - (  295 | RODRIGUEZ )
[41] APELLPADRINO   : 3716 -    482 - (  424 | RODRIGUEZ )



The births CSV file contains 5237 entries with 42 columns. The number of entries for each column varies widely and 12 contain no data at all. We can already see that Rodriguez is the most common surname and Manuel the most common name. Looking at "[20] NOMMADRE" we can see that the most common name for mothers, hence women, is Maria.

Other interesting data, albeit expected, is that the most common job "[25] OFICIOPAD" is "LABRADOR" (english: farmer) while the most common job for the padrino (Godfather) is CIRUJANO (surgeon). From this it is possible to infer that the Godfather role was most commonly taken by a higher social class person.

In addition there are 1798 comments "observaciones" with rich, yet unstructured, information, for example:

In [4]:
print( npBirthData[427,39] )

bautizé sup condicione un niño enmcontrado cuyos padres se ignoran el qual fue allado a la puerta de Felipe TORRES  a media noche



English: I baptised sup condicione a boy with unknown parents who was found at the door of Felipe Torres at midnight.

In [5]:
print( npBirthData[4013,39] )

Fue bautizada en el acto de la nacencia por su abuela Theodora Pitarque y murio immediatamente.
Examine la forma e intencion del bautismo y allí que estubo bien administrado



English: Was baptised at birth by her grandmother Theodora Pitarque and died immediately. I examined the intention and correctness of the baptism and found that it was well performed.

### 2.4 An overview of the funerals dataset¶

In [6]:
[headerDeath, npDeathData, NPDEATHDATA] = readDB ('data/allDefunciones.csv')


Number of Entries: 4330
Number of Columns:   28

# - Unique - Most common
[ 0] ALBACEDEF      :  471 -    401 - (   18 | EL RETOR )
[ 1] APELL1         : 3115 -    440 - (  244 | RODRIGUEZ )
[ 2] APELL2         : 1341 -    212 - (  128 | RODRIGUEZ )
[ 3] APELL2MAD      : 2246 -    316 - (  186 | RODRIGUEZ )
[ 4] APELL2PAD      : 2483 -    282 - (  222 | RODRIGUEZ )
[ 5] CONYGE1        : 1275 -    898 - (   17 | D )
[ 6] DEFUNCION      : 3704 -   3466 - (    4 | 5 6 1748 )
[ 7] ECIVCONY1      : 1193 -     13 - (  787 | ESP )
[ 8] DEFUNCION2     : 3703 -   3458 - (    4 | 25/3/1826 )
[ 9] edatTEXT       : 2649 -    176 - ( 1523 | P )
[10] ENTERRAMIE     : 3155 -   2856 - (   11 | 25 )
[11] ENTERRAMIENTO2 : 3155 -   2856 - (   11 | 25 )
[12] LIBRAS         :   31 -      9 - (    7 | 20 )
[13] NOMBRE         : 3978 -    742 - (  273 | MARIA )
[14] NOMMADRE       : 2429 -    246 - (  442 | MARIA )
[15] NOMPADRE       : 2673 -    205 - (  368 | MANUEL )
[16] NOTARIO1       :   40 -     29 - (    9 | LABRADOR )
[17] OBSERVACN      : 2407 -   1548 - (  131 | IG )
[18] OFICIO         :  334 -     78 - (   84 | LABRADOR )
[19] OFICIOMAD      :    3 -      3 - (    1 | AMA )
[20] OFICIOPAD      :  235 -     36 - (  109 | LABRADORES )
[21] ORIGPAD        :    9 -      6 - (    3 | ALGIMIA )
[22] SEXO           : 3923 -      4 - ( 2034 | HOMBRE )
[23] SUELDOS        :    4 -      3 - (    2 | 0 )
[24] TESTIGDEF      : 1312 -    439 - (  545 | EL PARROCO )
[25] TIPO           :  340 -     48 - (  262 | POBRE )
[26] VECINDAD       :  750 -     58 - (  621 | Ç )
[27] VECINDPAD      :   18 -      6 - (   11 | Ç )



Again, the most common surname is Rodriguez, given names are Manuel and Maria, and the most common job is "labrador". At this point we can see that the data is very incomplete with only 334 job entries out the 4330 subjects. However names of parents are plentiful (~2500, so we can hope to reconstruct some genealogy tree later on) and there are also observaciones (comments) in 2407 entries, which are quite interesting:

In [7]:
print( npDeathData[1051,17] )

IG
HF
No sacramentos: murió de repente sin poderlo remediar pues cayó en un lagar lleno de vino



English: IG, HF (Hijo de Familia, denotes a teenager), No Sacrament: Died suddenly without remedy because he fell in a wine press full of wine.

In [8]:
print( npDeathData[3794,17] )

Se encontro asesinado el dia 14 en la partida llamada mojonada de este termino natural y vecino de Segorbe



English: Was found murdered the on 14th in the "Mojonada" in this municipality. He was from Segore (Segorbe is the largest town in the region, 8Km away from Vall de Almonacid)

In [9]:
print( npDeathData[2825,17] )

De ALTURA que estaba en este pueblo por huir de los Franceses
Se le hizo entierro ordinario que pagaron sus hijos



English: From Altura (another village, 10 km away) who was in this village fleeing from the French. A regular funeral was performed paid by his children.

An yes, that was during the Spanish-French war (https://en.wikipedia.org/wiki/Peninsular_War):

In [10]:
print( npDeathData[2825,8] )

23/7/1811


In [11]:
print( npDeathData[524,17] )

“aviendo enfermado de enfermedad mortal, dicho dia al amanecer fue allada aogada en la balsa de la huerta mayor, y dudando el pueblo de la Vall si se le avia de dar sepultura o, no, cosa que al Pueblo no pertenecia sino al señor obispo de Segorbe y su vicario General, el señor don Chisostomo ROYO DE CASTELLVI, o RENS de la Sta Iglesia de Segorbe, su oficial general dicho dia mandaron que asta que se averiguare con testigos e informasemos si se avia desesperado, o, no, que la metiesen en una caxa hecha de tablas y fuese llevado el cadaver, asta que le recibieren los testº y se hicieron las debidas deligencias todo lño qual se hizo obedeciuendo al dicho Rereen dissimo Sr, fue llevado el cadaver a la partida llamada rincon del pico de espadan cerca de la raya del termino distanmcia un tiro de piedra en donde ay un secano cuyo dueño vivae en la alcudia. Despues se recibieron ofrendas de personas fidelkes y testificaron que el dia de S Pedro la avian visto confesar con el cura por espoacio de dos horas y el mismo cura certifica que se comparo generalmente  de toda su vida y aviendo tenido tar juntos de Teologos en diferentes dias se informaron del medico que la avia visitado en una enfermedad y dicho y testifico que su enfermedad le hallo en punto, que iba dedicando a delirio ultimamente dichos informaciones determino y mando a san Vicario general que se tocasen a muerto y se le dijera misa cantada de cuerpo presente y que el curo fuese personalmente  al pico de Espedan y que se le vendijese la sepultaura solemnemente y que estubiese el cadaver depritado asta tanto que qede convertido y despues que rezen tratidos los mesos la misma caxa el cementerio a Campo Santo del dixo lugar en donde esta enterrado los demas fieles Serian diez libras para que se celebre bien por su alma y que se le diga misas rezadas....
Esta enterrado el cadaver en el cementerio al pie de la cruz en 09.08.1682



English: Being mortally ill she was found drowned on the fields watering pond. The village doubted if she was to be buried on the graveyard or not (suicides would prevent people to get buried there) but the decision was to be taken by the bishop and the church. The general official of the bishop then instructed to put the body in a plank box and leave it to the "Rincon del pico espadan", at the edge of the municipality, until the witnesses could testify and the investigation was concluded. Witnesses testified that they had seen her confessing with the priest for 2 hours and the priest detailed that she had confesses about her entire life. Theologians consulted the doctor that had attended her and he explained that her illness induced her delirium. With this information the general vicar instructed to toll for the dead and that she was to have sang mass with the deceased's body present. In addition, once the body had been converted, the priest was to go personally to the "Rincon del pico espadan" and perform the funeral duties solemnly, accompanying the body until she was buried in the graveyard with all other Catholics.

## 3. Digging into the data¶

Now that we have an idea of what is in the data, let's see what can we learn from it.

### 3.1 The most common values for each field¶

To check the most common entries of each field I wrote the following function, which given a list of entries relies on the Counter structure to sort the items in the list and print them in order, with the number of occurrences in brackets:

In [12]:
def printMostCommon( vec, n=float("inf") ):

vec2 = vec [ vec != '' ]
count = Counter(vec2)

print ("Most common entries:")
for idx in range(len(count)):
if idx  > n-1:
break
print ( "%4d : %s" % (count.most_common()[idx][1], count.most_common()[idx][0]) )


A quick look to jobs show that something is amiss.

In [13]:
printMostCommon( NPDEATHDATA[:,18], 15 )

Most common entries:
36 : HOSPITAL GENERAL
21 : STO HOSP
18 : HOSP GRAL VCIA
16 : AMA
10 : STO HOSPITAL
6 : PASTOR
4 : TEXEDOR
4 : HOSPITAL GENERAL DE VALENCIA
3 : CIRUJANO
3 : SASTRE



So followed by farmer, the most common job was... farmer (but the female version of it)!

Nevertheless something is amiss here, as the 3rd, 4th, and 5th most common jobs are a hospital. The most plausible explanation though is that these were kids that had been taken from a hospital (couples who couldn't have kids, or only had girls, usually did that back in the day) and died young, and the priest (who wrote these records) opted to mention that in the job field.

Notice that beside farmer, another "popular" job was "LABRADOR POBRE" (Poor farmer).

In [14]:
printMostCommon( NPDEATHDATA[:,13], 15 )

Most common entries:
273 : MARIA
271 : MANUEL
183 : FRANCISCO
148 : JOSEPH
148 : MIGUEL
127 : MANUELA
124 : JOSEF
120 : VICENTE
111 : ROSA
80 : JUAN
79 : FRANCISCA
78 : VICENTA
67 : TERESA
64 : MARIA ROSA
51 : ANTONIO



Looking at the 15 most common names we can see Joseph and Josef add to 272 occurrences, overtaking Manuel as the most common name. And that is without counting all the Jose and compound names so popular in Spain.

### 3.2 Delay between death and funeral¶

It is common in Spain to bury the deceased in the few immediate days after death. That contrasts with other Catholic countries, like Ireland, where burials can happen up to a week later than the death.

We can check the distance between death and burial, as we have these two dates in NPDEATHDATA[:,8] and NPDEATHDATA[:,11]. For this we have to use the try instruction to avoid getting errors when there is no date string or this is uncomplete.

In [15]:
diffBurial = []
for ii, defuncion in enumerate(NPDEATHDATA[:,8]):
if (defuncion != '') and (NPDEATHDATA[ii,11] != ''):
try:
deathDate  = datetime.datetime.strptime(defuncion,     '%d/%m/%Y').date()
burialDate = datetime.datetime.strptime(NPDEATHDATA[ii,11], '%d/%m/%Y').date()
diffBurial.append( (burialDate - deathDate).days )
except:
pass

delay = Counter(diffBurial)
print ( "\nDelay between death and funeral for %d funerals:" % len(diffBurial))
for idx in range(len(delay)):
print ( "%4d : %d dias" % (delay.most_common()[idx][1], delay.most_common()[idx][0]) )


Delay between death and funeral for 2349 funerals:
2283 : 1 dias
25 : 0 dias
13 : 2 dias
4 : -1 dias
3 : -30 dias
2 : 367 dias
1 : 3 dias
1 : 4 dias
1 : 5 dias
1 : 7 dias
1 : -89848 dias
1 : 10 dias
1 : -1460 dias
1 : 335 dias
1 : -364 dias
1 : 21 dias
1 : -96423 dias
1 : 90 dias
1 : 93 dias
1 : 29 dias
1 : 366 dias
1 : 8 dias
1 : -81479 dias
1 : 381 dias
1 : -2 dias



From 2349 funerals, 2283 (97.2%) happened the day immediately after the death, 25 the same day, 13 after 2 days, and a 7 between 3 and 21 days.

Keep in mind, however, that for the 4330 entreis, there is no recording of the death date in around 2000. Probably because dead and funeral happened at the same time and the priests opted for only filling one field in the book.

In addition it can clearly be seen that there are numerous typos on the data, as 12 entries report a funeral before the death. In 12 other cases it looks like there is shift likely to be due to a typo (1 month, 3 months, 1 year, etc).

### 3.3 Distribution of events in time¶

Next step is to visualise the distribution of births and deaths by year, and also by month. For that I wrote these 3 functions:

In [16]:
# Hide the right and top axes
def remove2axis(ax):
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# Only show ticks on the left and bottom spines
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')

# Modified z-score
median = np.median(vec)
median_absolute_deviation = np.median([np.abs(v - median) for v in vec])
modified_z_scores = [0.6745 * (v - median) / median_absolute_deviation for v in vec]
return modified_z_scores

# Print and plot distributions
def plotDateDistribution ( vec, str_ylabel, showPlot=True, showByYear=True, showByDay=False ):

#        str_day = ['Domingo', 'Lunes', 'Martes', 'Miercoles', 'Jueves', 'Viernes', 'Sabado']
#        str_months = ['Enero', 'Febrero', 'Marzo', 'Abril', 'Mayo', 'Junio',
#                    'Julio', 'Agosto', 'Setiembre', 'Octubre', 'Noviembre', 'Diciembre']

str_day = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
str_months = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']

distributionDay   = [0] * 7
distributionMonth = [0] * 12
distributionYear  = [0] * 2000

# put entries in bin for each occurance
for d in vec:
try:
deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date()
distributionDay[deathDate.weekday()] += 1
distributionMonth[deathDate.month-1] += 1
distributionYear[deathDate.year]     += 1
except:
pass

# compute z-scores to detect outliers

# find first and last year with data
min_year = next(i for i, v in enumerate(distributionYear) if v > 0)
max_year_tmp = next(i for i, v in enumerate(reversed(distributionYear)) if v > 0)
max_year = len(distributionYear) - 1 - max_year_tmp
yearRange = np.arange(min_year, max_year+1)

# Print distributions on console
if showByDay:
print( "\nBy day: ")
for i, v in enumerate(distributionDay):
print ( "%12s : %4d ( %+.2f )" % (str_day[i], v, zScoreDay[i]) )

print( "\nBy month: ")
for i, v in enumerate(distributionMonth):
print ( "%12s : %4d ( %+.2f )" % (str_months[i], v, zScoreMonth[i]) )

if showPlot:
fig = plt.figure()
fig.suptitle("By Month")
ax1 = plt.gca()
ax1.bar(range(12), distributionMonth, align='center', linewidth = '0', alpha=0.8, color='darkorange')
ax1.set_xticks( range(12) )
ax1.set_xticklabels( [v[0] for v in str_months] )
ax1.set_ylabel( str_ylabel )
plt.margins(0.05, 0)
x1,x2,y1,y2 = ax1.axis()
ax1.axis((x1,x2,y1,y2 + 5))
remove2axis(ax1)

if showByDay:
fig = plt.figure()
fig.suptitle("By Day")
ax2 = plt.gca()
ax2.bar(range(7), distributionDay, align='center', linewidth = '0', alpha=0.8, color='darkorange')
ax2.set_xticks( range(7) )
ax2.set_xticklabels( [v[0] for v in str_day] )
plt.ylabel( str_ylabel )
plt.margins(0.05, 0)
x1,x2,y1,y2 = ax2.axis()
ax2.axis((x1,x2,y1,y2 + 5))
remove2axis(ax2)

if showByYear:
values = distributionYear[min_year:max_year+1]
fig = plt.figure()
fig.suptitle("By Year")
ax3 = plt.gca()
ax3.bar( yearRange, values, align='center', width=1, linewidth = '0', alpha=0.8, color='darkorange')
ax3.set_xticks( [v for v in yearRange if v % 10 == 0] )
plt.ylabel( str_ylabel )
plt.margins(0.03, 0)
x1,x2,y1,y2 = ax3.axis()
ax3.axis((x1,x2,y1,y2 + 5))
remove2axis(ax3)

#            fig.subplots_adjust(left=0.06, right=0.99, top= 0.99, bottom=0.05)
plt.show()

else:
fig = None

return [fig, distributionDay, distributionMonth, distributionYear]


Notice that this function also allows to explore the distributions of births and deaths across the days of the week. However I removed that part of the analysis as nothing interesting came out of it (all distributions were basically uniform)

#### 3.3.1 When did births occur?¶

In [30]:
[figBirth, birthByDay, birthByMonth, birthByYear] = plotDateDistribution ( npBirthData[:,4], '# Births', True )


By month:
January :  447 ( +0.09 )
February :  475 ( +0.65 )
March :  570 ( +2.57 )
April :  438 ( -0.09 )
May :  513 ( +1.42 )
June :  452 ( +0.19 )
July :  462 ( +0.39 )
August :  354 ( -1.78 )
September :  336 ( -2.14 )
October :  432 ( -0.21 )
November :  408 ( -0.69 )
December :  350 ( -1.86 )



The distribution by month shows a higher rate of births in the spring. March, the month with highest number of births has a modified z-score of 2.57. So a bit of an outlier but nothing exceptional (usually an outlier is considered as such when the z-score is over 3.5).

Despite not statistically significant, December shows a reducition in births, probably due to the reduced conception during Lent.

On the distribution by year we can see a steady increase in births along time, with a marked boom in the 1750 to 1770.

#### 3.3.2 When did deaths occur?¶

In [18]:
[figDeath, deathByDay, deathByMonth, deathByYear] = plotDateDistribution ( npDeathData[:,8], '# Funerals', True )


By month:
January :  322 ( +0.78 )
February :  236 ( -1.25 )
March :  272 ( -0.40 )
April :  293 ( +0.09 )
May :  277 ( -0.28 )
June :  259 ( -0.71 )
July :  364 ( +1.77 )
August :  486 ( +4.66 )
September :  325 ( +0.85 )
October :  285 ( -0.09 )
November :  267 ( -0.52 )
December :  316 ( +0.64 )



From the month distribution we see that August (A) has an abnormal spike of deaths with a modified z-score of +4.66 (Usually anything above 3.5 is considered outlier). In addition July and September are the 2nd and 3rd months with highest number of funerals.

At first we could think that given the high child mortality back then, a spike in deaths could be explained by a spike in births. However looking at the distribution of births previously plotted we see than that is not case, with August and Septembre being the months with lowest number of births. Could it be that summer and the spanish heat is worse for the population? Maybe it was due to accidents in the fields, wars being more common in the summer? We don't know, but we will come back to analysis in Section 3.4.

Distribution by year show no births from 1637 until 1663. This, I have been told, was due not because of a lack of births, but because the sheets from the records book of that period were tore. Probably to light a fire...

Similarly, in 1780 there are no deaths reported, but this is also probably due to missing records, as birth records exist therefore it is likely that the priest would have recorded funerals as well.

Another interesting fact is 1803, with that huge spike in deaths: Could it be wars, the plague, a bad crop? We don't know but we will take a look to that later on in Section 3.5, to see what the data says.

### 3.4 Trying to explain the spike of deaths in the summer¶

As seen in the distribution of deaths by month of the year, the summer months showed a higher mortality. This was a bit of a surprise for me, as I expected a nearly uniform distribution or an increase during winter, when the conditions are harsh (subzero temperatures are not rare during the winter months in this region).

In order to investigate any possible underlying cause that explains the super mortality spike, we first will separate adults from children:

In the field NPDEATHDATA[:,9] there is written an abbreviation of the age of people at their funerals. 'ALB' or 'A' (Albado/a), 'P' (Parvulo), 'N' (Niño/a), 'NIÑO' and 'NIÑA' denote children, while an empty field '' denote adults. HF (Hijo de familia) and the remaining indicators are excluded from this analysis.

In [19]:
printMostCommon( NPDEATHDATA[:,9], 15 )

Most common entries:
1523 : P
391 : A
158 : N
85 : ALB
57 : HF
25 : RN
23 : 2
19 : 1
16 : D
11 : MY
9 : NIÑA
8 : 4
7 : 18M
7 : 73
7 : NIÑO


In [20]:
# Separate adults and children
children = []

for i, v in enumerate(NPDEATHDATA[:,9]):
if v == '':
if v in ['P', 'ALB', 'A', 'N', 'NIÑA', 'ŃIÑO']:
children.append(NPDEATHDATA[i,8])


Total number of adult deaths: 1681

By month:
January :  138 ( +1.00 )
February :  121 ( -0.10 )
March :  120 ( -0.16 )
April :  141 ( +1.19 )
May :  122 ( -0.03 )
June :  111 ( -0.74 )
July :  123 ( +0.03 )
August :  138 ( +1.00 )
September :  113 ( -0.61 )
October :  142 ( +1.25 )
November :  118 ( -0.29 )
December :  135 ( +0.80 )



When only looking at adult deaths, the distribution per months seems pretty niform, without a summer spike as before.

In [21]:
print ("Total number of children deaths: %d" % len(children))

[figDethChildren, deathByDayChildren, deathByMonthChildren, deathByYearChildren] = \
plotDateDistribution ( children, '# Funerals Children', True )

Total number of children deaths: 2166

By month:
January :  154 ( +1.10 )
February :   91 ( -1.56 )
March :  130 ( +0.08 )
April :  118 ( -0.42 )
May :  120 ( -0.34 )
June :  126 ( -0.08 )
July :  211 ( +3.50 )
August :  316 ( +7.93 )
September :  181 ( +2.23 )
October :  121 ( -0.30 )
November :  124 ( -0.17 )
December :  150 ( +0.93 )



When we focus on children, the summer spike becomes increasingly pronounced (z-score +7.93). This should then exclude the explanation that summer deaths were related to an increase in labour activity or warfare, a task mainly performed by adults, or older children.

To exclude violent deaths as the cause of the summer spike in mortality, we can look at deaths separated by gender. A high rate of male deaths could indicate a spike in violent deaths, as men were more involved in warfare and farming labour:

In [22]:
# Spearate adults deaths by gender
male = []
female = []

for i, v in enumerate(NPDEATHDATA[:,9]):
if v == '':
if NPDEATHDATA[i,22] == 'HOMBRE':
male.append(NPDEATHDATA[i,8])
else:
female.append(NPDEATHDATA[i,8])

print ("Total number of male adult deaths: %d" % len(male))
print ("Total number of female adult deaths: %d" % len(female))

Total number of male adult deaths: 796
Total number of female adult deaths: 885



And the ratio by month:

In [23]:
deathByMonthMaleAdults   = [0] * 12

for m in male:
try:
deathDate = datetime.datetime.strptime(m, '%d/%m/%Y').date()
except:
pass

for f in female:
try:
deathDate = datetime.datetime.strptime(f, '%d/%m/%Y').date()
except:
pass

print( "\nBy month: ")
for i, v in enumerate(MFratio):
print ( "%12s : %4.1f ( %+.2f )" % (i, v, MFscore[i]) )


By month:
0 : 53.6 ( +1.84 )
1 : 45.5 ( -0.35 )
2 : 47.5 ( +0.20 )
3 : 48.9 ( +0.58 )
4 : 41.0 ( -1.55 )
5 : 45.9 ( -0.22 )
6 : 43.9 ( -0.77 )
7 : 52.2 ( +1.45 )
8 : 46.0 ( -0.20 )
9 : 50.0 ( +0.87 )
10 : 44.9 ( -0.49 )
11 : 57.0 ( +2.75 )


In [24]:
fig = plt.figure()
plt.plot(range(12), MFratio, alpha=0.8, linewidth=4, color='darkorange')
plt.plot([-0, 11], [50, 50], linewidth=2, alpha=0.3, color='gray')
plt.ylabel( 'Percentage of deaths that were male' )
plt.xticks( range(12), ['J', 'F', 'M', 'A', 'M', 'J', 'J', 'A', 'S', 'O', 'N', 'D'] )
plt.margins(0.05)
remove2axis(plt.gca())


The rate of (adult) male deaths remains between 41% (May) to 57% (December), with an august rate of 52.2% in August. This constant ratio suggests that the increase in summer deaths were not violent in nature.

So while it is hard to extract conclusions from this limited dataset, the data shows that the summers had an increased mortality among children that was not related to violent deaths.

When we compare this data with other demographic studies we see that in "Evolucion Demografica de Cortes de Arenoso desde 1560 a 1660" by Antonio Poveda Ayora, MSc Thesis, Universidad de Valencia, 1982, a similar spike is reported in the month of August (with a 38% increase over the mean). In page 282 he reports:

"El máximo de defunciones que se registran en Cortes de Arenoso en los meses de agosto y septiembre puede deberse a la incidencia que, sobre estos meses, tendría la mortalidad de niños, pues las causas de la mortalidad infantil eran, fundamentalmente, las diarreas estivales antes de las lluvias de otoño, debidas a la sequedad, altas temperaturas, empleo de aguas contaminadas, etc."

English: "The maximum of deaths registered in Cortes de Arenoso in August and September can be due the high children's mortality, as main children mortality causes were summer diarrhea before autumn rains as consequence of dry weather, high temperatures, contaminated waters, etc."

### 3.5 Child mortality as an index of prosperity¶

We have seen that the period between 1750 and 1780 showed an increase in births that might be attributed to a higher standard of living. To confirm that we can take a look at child mortality, as there is an association between higher child mortality and poorer living conditions.

In [25]:
# we already have computed deathByYearChildren and deathByYearAdult
#ChildrenMortalityRatio = np.array(deathByYearChildren) / (np.array(deathByYearAdults) + np.array(deathByYearChildren)) * 100

childrenMortalityRatio = [np.nan] * 2000
for i, v in enumerate(deathByYearChildren):
if v > 0 or deathByYearAdults[i] > 0:
childrenMortalityRatio[i] = deathByYearChildren[i] / (deathByYearChildren[i] + deathByYearAdults[i]) * 100

#print( childrenMortalityRatio )

# find first and last year with data
min_year = next(i for i, v in enumerate(childrenMortalityRatio) if v != np.nan)
max_year_tmp = next(i for i, v in enumerate(reversed(childrenMortalityRatio)) if v != np.nan)
max_year = len(childrenMortalityRatio) - 1 - max_year_tmp
yearRange = np.arange(min_year, max_year+1)

fig = plt.figure()
fig.suptitle("Child mortality by year")
plt.bar(yearRange, childrenMortalityRatio[min_year:max_year+1], align='center', linewidth=0, width=1, color='darkorange', alpha=0.8)
#plt.plot([-0, 11], [50, 50])
plt.ylabel( 'Children mortality ratio' )

plt.xticks( [v for v in yearRange if v % 10 == 0] )
plt.margins(0.03, 0)
remove2axis(plt.gca())


It is clear from that graph that the expected decline in child mortality is not present in the period between 1750 to 1770. In addition we can see that there is virtually no child mortality between 1705 and 1750. This was clear not the case and the likely reason for that gap is that child mortality was not reported during this period.

Similarly, the 100% child mortality reported from 1840 to 1850 is probably a consequence of bad encoding of deaths of adults, or bad parsing of the data.

As seen in the plot of all deaths by year,a huge spike in mortality occurs in 1803. To study the possible explanation of that spike we first look at the distributions by month on that year:

In [26]:
deaths1803 = []
deaths1803Child = []

for i, d in enumerate(npDeathData[:,8]):
try:
deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date()
if deathDate.year == 1803:
deaths1803.append(d)
if NPDEATHDATA[i,9] == '':
if NPDEATHDATA[i,9] in ['P', 'ALB', 'A', 'N', 'NIÑA', 'ŃIÑO']:
deaths1803Child.append(d)
except:
pass

[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803, '# Funerals in 1803', True, False )


By month:
January :   33 ( +9.44 )
February :   10 ( +1.69 )
March :    6 ( +0.34 )
April :    5 ( +0.00 )
May :    5 ( +0.00 )
June :    1 ( -1.35 )
July :    3 ( -0.67 )
August :    7 ( +0.67 )
September :    4 ( -0.34 )
October :    5 ( +0.00 )
November :    0 ( -1.69 )
December :    3 ( -0.67 )



If we lat at the distribution independently for children and adults:

In [27]:
[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803Adult, '# Adult funerals in 1803', True, False )
[figDeath1803, deathByDay1803, deathByMonth1803, deathByYear1803] = plotDateDistribution ( deaths1803Child, '# Children funerals in 1803', True, False )


By month:
January :    1 ( +0.67 )
February :    0 ( -0.67 )
March :    4 ( +4.72 )
April :    1 ( +0.67 )
May :    0 ( -0.67 )
June :    0 ( -0.67 )
July :    0 ( -0.67 )
August :    1 ( +0.67 )
September :    0 ( -0.67 )
October :    1 ( +0.67 )
November :    0 ( -0.67 )
December :    1 ( +0.67 )



By month:
January :   32 ( +9.44 )
February :   10 ( +2.02 )
March :    2 ( -0.67 )
April :    4 ( +0.00 )
May :    5 ( +0.34 )
June :    1 ( -1.01 )
July :    3 ( -0.34 )
August :    6 ( +0.67 )
September :    4 ( +0.00 )
October :    4 ( +0.00 )
November :    0 ( -1.35 )
December :    2 ( -0.67 )



We can see that the exceptional number of 33 deaths (of which 32 where children) occured in the single month of January 1803. Was that a trend coming from 1802?

In [28]:
deaths1802 = []
for d in npDeathData[:,8]:
try:
deathDate = datetime.datetime.strptime(d, '%d/%m/%Y').date()
if deathDate.year == 1802:
deaths1802.append(d)
except:
pass

[figDeath1802, deathByDay1802, deathByMonth1802, deathByYear1802] = plotDateDistribution ( deaths1802, '# Funerals 1802', True, False )


By month:
January :    4 ( +0.45 )
February :    0 ( -1.35 )
March :    5 ( +0.90 )
April :    3 ( +0.00 )
May :    5 ( +0.90 )
June :    1 ( -0.90 )
July :    1 ( -0.90 )
August :    2 ( -0.45 )
September :    2 ( -0.45 )
October :    3 ( +0.00 )
November :    3 ( +0.00 )
December :    5 ( +0.90 )



No, January 1803 was an isolated case of high mortality. Maybe a bad crop, a strong winter, or a sever flu? We don't know.

Ot the spike is because somehow a bunnch of births were recorded with the same date repeated? Maybe the 1st of January of 1803?

In [29]:
# to dates type
dates1803 = [datetime.datetime.strptime(strDate, '%d/%m/%Y').date() for strDate in deaths1803]
dayOfMonth = [d.day for d in dates1803 if d.month == 1] # only for january
print(dayOfMonth)

# because who needs np.histogram()?
distDay = [0] * 32
for d in dayOfMonth:
distDay[d] += 1

fig = plt.figure()
plt.bar(range(len(distDay)), distDay, alpha=0.8, linewidth=0, color='darkorange')
plt.ylabel( '# Funerals January 1803, per day' )
remove2axis(plt.gca())

[3, 4, 4, 4, 5, 5, 5, 8, 8, 10, 10, 11, 11, 13, 14, 14, 15, 15, 16, 16, 17, 19, 19, 23, 25, 25, 25, 25, 27, 28, 29, 31, 31]



No, the deaths are spread among the days of the month. January 1803 was simply a tough month.

## 4. Discussion and conclusions¶

While the data contains numerous inaccuracies and typographical errors, we can already see certain trends.

The first interesting result that arises from this study is that summers, specially August, showed a higher children mortality rate than the rest of the year. This trend is seen in other demographic studies of the region where it is attributed to the hotter conditions before autunm rains. In this study we showed that the reasons for such increase in mortality were probably not related to violent deaths, as male mortality (associated with warfare and labour accidents) remained under 50% during this period.

Furthermore, the months of Decembre and January showed a second spike in mortality, more pronounced among children andmales. This is also consistent with other studies of the region.

A second result of interest is that January of 1803 was a month with very high mortality among children, albeit we don't know the reason for that.

In addition, we can see an increase in number of births between 1750 and 1780. However we do not see a decrease in child mortality during the same period that would suggest an increase in standards of living during this period.